Before Making A Big Splash – 5 “Gotchas” To Avoid When Building a Data Lake

October 2nd, 2020

This article is by Carolyn Davis and originally appeared on the Qlik Blog here:


There is no debate. Making sense of your data is just good business. Studies from Forrester ResearchMcKinsey and more show that companies that leverage their data better tend to outperform their less-informed counterparts. Our own recent research, Data as the New Water: The Importance of Investing in Data and Analytics Pipelines, done in partnership with IDC, shows that companies that optimize their data pipelines see enhanced operational efficiency (88 percent vs. an average of 76 percent), improved revenue (86 percent vs. an average of 75 percent) and increased profit (90 percent vs. an average of 76 percent).

Despite this, most businesses use just a fraction of their data, as little as 10 percent, for analysis. Worse yet, less than a third of executives state they can derive business value from their data.

This poor usage of data is indicative of companies not optimizing their entire data value chain. In most companies, the effort has been primarily on more glitzy parts of the data pipeline – on analysis, AI, machine learning and eye-catching visualizations. The hard lifting, not as exciting part of data pipeline – involves processing data and getting it in shape where it can be analyzed -- and that continues to be underinvested in and underappreciated.

Even companies that recognize the value of building data pipelines, often lose footing on some of the basic tenets of data management. They are so focused on collecting all types of fast-flowing data, that they forget that to create value, data needs to be usable and accessible, too.

Although it is more of an issue in data lakes than data warehouses, which by design store formatted, cleansed and primarily structured data, data lakes continue to be an important architectural construct for organizations due to their ability to store large volumes of highly diverse data from multiple sources. But, if not planned and built properly, it is this very flexibility of data lakes that can quickly turn them into data dumps.

The value of data lakes comes not just from their ability to quickly and cost-effectively store all types of data, but also from processing and refining that raw data into an analytics-ready state, so that the data is actionable and accessible for exploration and analysis.

So, how do you avoid becoming a statistic in yet another failed data lake project? Here are the five questions to keep in mind when building a data lake:

1. How will you fill/hydrate your data lake?

This question should not only trigger thoughts regarding which technology to adopt to ingest data into the data lake but also more foundational questions around the end goal of building the data lake. Is it specific to a business function, or will it be the central source of truth for the enterprise? That will help you determine from which data sources you wish to ingest data and which types of data to select. Do you need to accommodate for slowly changing dimensions? What about the frequency of data delivery – does it need to be delivered incrementally, as it comes in, or will batch loads suffice? Check how easy it is to configure, monitor and manage data pipelines as new sources get added and data architecture evolves.

Filling a data lake is not one-and-done operation. Speed of data transfer, support for real-time incremental change loading, breadth of source systems and target platforms supported, on premise and in cloud, ease of use and automation for pipeline design and execution are all critical considerations to get faster value from your data lake.

2. What does it take to make the data usable?

It is easier to move data into the data lake than it is to process and refine data (i.e., to make it consumption-ready for analytics, AI and machine learning initiatives in a timely manner). Consider what it takes to get data into the curated, analytics-ready state your data consumers expect. How time and resource-intensive is the process? Can you propagate changes from source systems and maintain history as data definitions and structures change?

Your programming and data science resources are limited. Automation, reusable ETL scripts and workflows can accelerate raw to actionable data pipeline deployment and be the difference between leapfrogging the competition vs. lost opportunities.

3. Can you trust the data?

Issues of data security, quality, consistency and governance are critical to data lake value. Data lakes can quickly become data swamps if data is dumped without consistent data definitions and metadata models. Check for the ability to auto-generate & augment metadata, tag and secure sensitive data, and establish enterprise-wide access controls.

Data in data lakes is of value only if data consumers can understand and use data, verify its origin and trust its quality. Integrated catalog for automated data profiling and metadata generation, lineage, data security and governance are critical to building a successful data lake.

4. Is the data accessible?

One of the top reasons for data lake failures is the inability to access and consume data at the speed of the market. Remember – it is not enough to just store data in the data lake; data should also be usable and accessible to create value. Data consumers’ inability to easily find, understand and self-provision desired datasets – or their dependence on data scientists or specialized programmers to extract data – equals delayed, dated data.

A user-friendly marketplace capability for search and evaluation, as well as self-service preparation of derivative datasets can fast-track data lake value realization.

5. Is the data portable and platform adaptive?

Data architectures are evolving. Companies are migrating to cloud, switching vendors, managing cross-cloud and hybrid environments and adopting new technologies. The fastest way to obsolescence is tying yourself to a technology or vendor that can’t adapt to your changing needs.

Start with the end in mind for most flexibility, so you don’t lock-in your data and can move and adapt as organizational architecture and needs evolve.

Data lakes have huge potential for serving data needs of multiple users and use cases in a cost-effective manner if planned and managed well. The key is to not start blindly. Instead, ask the right questions and consider various requirements. Even for companies that have already started or are looking for ways to improve the effectiveness of their existing data lakes, it is not too late. Reviewing your project strengths and weaknesses against the recommendations I’ve laid out above can put your data lake project on the right path and accelerate value realization.