Is the Data Lake Dead?

This article is by Joe DosSantos and originally appeared on the Qlik Blog here: https://blog.qlik.com/is-the-data-lake-dead

 

Assumption #1:“Data storage is expensive, so let’s build our Hadoop data lake, where the economics look much more attractive.”

How Does This Assumption Look in Hindsight?

To be sure, my experience has been that the TCO per GB of storage in Hadoop can be 5 percent or less than the cost of traditional RDBMS systems. But, even the most experienced enterprises learned quickly just how hard it was to operate an enterprise cluster. The constant updates of open source software, the scarcity of skills to manage an environment, and the relative immaturity of the ecosystem created technical glitches and dependencies that were difficult to manage. On top of this, once Hadoop had completed data replication three times and administrators required snapshots and copies to overcome the limitations of Hadoop updates, 1 TB of RDBMS data might turn into 50 TBs in the lake. So much for those savings.

The Emerging Reality: Cloud and Cloud Data Warehouse

Amazon, Microsoft and Google rushed in to fill these productivity gaps with managed, cloud-based environments, which simplified administration and made data scientists more productive more quickly. Next, consumption models replaced the capital costs of Hadoop on-prem environments, which meant that people were less inclined to simply dump all of their large data sets into one central environment. Rather, they loaded data as required for analytics. This, as a result, had the effect of moving away from large on-prem data lakes to smaller cloud-based data ponds that were built for purpose. Taking this one step further, new cloud warehouses have made accessing and querying this data simple with SQL-based tools, which further unleash the value of the data to non-technical consumers.

Assumption #2:“Big Data is too big to move. Move the data once and move the computer to the data.”

How Does This Assumption Look in Hindsight?

One key assumption of the data lake was that limitations in network and processing speed would mean that we could not take large copies of data, such as log files, and move them to a cluster for data analytics. Hadoop was also batch oriented, meaning that large batches of these types of data were highly impractical. It turns out that improvements to data replication and streaming, as well as tremendous gains in networking, have caused this to be less true than we thought.

The Emerging Reality: Data Virtualization and Streaming

Improvements in technology have meant that enterprises have choices in how to access data.Perhaps, they want to offload queries from transactional systems to a cloud environment; data replication and streaming are now easy solutions. Perhaps, a transactional system is built for high performance queries; in that case, data virtualization capabilities can make that data available on demand. As a result, companies now have options to make data more available on demand for DataOps processes, meaning that there is not always the need to centralize all enterprise data physically in one location.

Assumption #3:“The data lake schema on read will replace the data warehouse schema on write.”

How Does This Assumption Look in Hindsight?

People were sick and tired of how long it took IT teams to write ETL into data warehouses and were desperate to simply unleash data scientists on raw data. There were two major sticking points. First, data scientists often could not easily find the data that they were looking for.Secondly, once they had the data, analytics leads soon found out that their ETL was simply replaced by data wrangling tools because data science still required cleanup, such as standardization and foreign-key matching.

The Emerging Reality: Data Catalogs and DataOps

smart data catalog has become essential to finding the data that you need. Companies are now trying to establish the same kind of Google search that a user enjoys at home in the workplace with simple solutions to find and access data, regardless of the physical location of the data stores that hold the data. DataOps processes have also emerged as a way to establish domain-based data sets that are carefully planned and governed to enable maximum analytics productivity. Thus, a data scientist should be able to easily find and trust the data that they are using to discover new insights, and a thoughtful blend of technology and process should allow for rapid operationalization of data pipelines and analytics pipelines to support these new discoveries. This process can enable real-time analytics.

As we at Qlik look to modernize our data analytics architecture, these key emerging realities are at the forefront of our thinking:

  • Cloud-Based Application and Analytics Architectures
  • A re-emergence of Data Warehouse/RDBMS structures in the cloud to maximize value (think Snowflake)
  • Data Streaming to reduce the latency of key data
  • Data Virtualization to reduce the copying of data until required
  • Data Catalogs to carefully inventory and manage access to enterprise data
  • The emergence of DataOps processes to create quick time to market for data and analytics pipelines.