Importance of Managed Data Lakes in Times of Coronavirus

May 25th, 2020

This article is by Ritu Jain and originally appeared on the Qlik Blog here:


Unless you are like this American couple living off-grid or Jared Leto, you are probably well aware of the current Coronavirus pandemic, its global impact and the rush against time investigators are facing in finding treatment options. To expedite treatment development, scientists and medical experts around the world are using innovative approaches to shorten the timeline of drug discoveries and approvals to combat Coronavirus. And, as they move forward, leading researchers are recognizing the criticality of Big Data and analytics-driven approaches.

To quickly and efficiently investigate the efficacy of these drugs, scientists need to bring together patient data from across the world, along with the research data on the drugs under evaluation. Then, they must quickly process and refine this mostly unstructured or semi-structured data into an analytics-ready state, to be consumed by artificial intelligence (AI) and machine learning (ML) tools for rapid exploration and evaluation. Data lakes are an efficient and scalable platform to harness all this data and enable analysis.

However, traditional approaches to bringing massive volumes of multi-source/multi-format data together into a data lake – whether it is medical data on hundreds of thousands of patients from across the globe for leading drug discovery programs, or customer behavioral and attitudinal data for more commercial hyper-personalization initiatives – and getting it analytics-ready, are slow, time/resource-intensive and error-prone.

The inability to quickly and easily extract data from a variety of ever-growing source systems; data transfer bottlenecks; challenges to adapt to changing platforms; cumbersome, coding-intensive data refinement processes; and data integrity and trust issues – all make realizing the timely return from data lake initiatives challenging.

Managed data lake creation can help organizations overcome these obstacles and accelerate the delivery of continuously updated analytics-ready data for AI, ML and other data science initiatives to fast-track insights. However, data lake architecture is evolving. As you plan to build more agile, highly performant data lakes, keep the following key considerations in mind to future-proof your investments:

  • Platform Independence. Data sources, target endpoints and platforms are constantly evolving. Ensure that the solution you choose is not tied to any specific cloud provider or analytic platform, providing you the flexibility to adapt to ever-changing/growing sources, targets and platforms, so you can consume data into analytic tools of choice.
  • End-to-End Automation. AI and ML models require a constant stream of continuously updated data to iterate and improve. Look for a solution that enables fully automated data lake pipelines, all the way from data ingestion, to transformation and creation of analytics-ready data, to provisioning of business-purposed data sets, thereby ensuring real-time data availability for data consumers.
  • Data Integrity and Trust. Data lakes run the risk of quickly becoming data swamps, if data is dumped without consistent data definitions and metadata models – or if consumers can’t quickly access and understand data, verify its origin and trust its quality. The administrative burden of ensuring data accuracy and consistency can delay and even kill the most well-funded analytics projects. When evaluating a solution, ensure it retains the entire change history for end-to-end data lineage, and comes with an integrated and secure data catalog that automatically generates rich metadata to allow data consumers to quickly and easily find, understand and use data.
  • IT and Business Alignment. To realize timely ROI from data lakes, you need to ensure alignment between IT and business users’ needs. While IT needs the ability to quickly and easily configure data lake pipelines, provide analytics-ready data for data consumers and ensure data security and governance, business users need the ability to quickly find, understand and self-provision data so they can make it actionable. When selecting, choose a solution that provides robust automation, security and governance your IT resources seek, and data-consumer-friendly features like enterprise-wide catalog with centralized data marketplace, so your data scientists and business users can focus on high-value insights generation tasks.

Coronavirus is not only putting the health of millions of people in peril; it is also ravaging economic stability. As supply chains struggle and consumer spending takes a hit, organizations need superior analytic insights to cope with an unprecedented situation. Managed data lake creation can provide you actionable insights in real-time by bringing together the data you need – operational, transactional, partner and syndicated – in an analytics-ready state.