This article was written by Ankur Gupta and originally appeared on the Collibra Data Governance Blog here: https://www.collibra.com/blog/defining-data-observability
Forbes defines Data Observability as a set of tools to track the health of enterprise data systems, and identify and troubleshoot problems when things go wrong. Data Observability combines monitoring, tracking, and troubleshooting of data to maintain a healthy data system.
According to the rule of ten, it costs ten times as much to complete a unit of work when data is flawed than when data is perfect. The cost of quality 1-10-100 rule emphasizes that prevention is less costly than correction is less costly than failure. If catching a data quality error costs $1, fixing it can cost $10, and by the time it affects the strategic decisions, the cost can balloon to $100.
Detecting unexpected issues with automated rules, data observability tools can proactively prevent such errors, reduce data downtime, and improve data quality.
Increasingly complex data sources
As the volume and variety of data sources increase, organizations struggle with a vast amount of diverse data. The various data storage options, numerous data pipelines, and an array of enterprise applications add to the complexity of data management. Handling these complex sources to deliver trusted data in real-time comes with inherent possibilities for data quality issues.
DataOps engineers rely on standard tools to gain insights into data systems, but they often fail to get the business context of data. This missing context does not provide sufficient information about the data quality issues, their business impact, and the potential causes.
Poor data quality disrupts the business value chain, leading to failed sales orders, delayed shipments, invoices stuck in the system, or poor customer experiences. If organizations cannot identify the criticality and consequences of the data issues, they will have trouble deciding the course of action.
Why monitoring data pipelines is important
Large volumes of data can never be 100% error-free. Duplicate data, inconsistent data, schema changes, data drift – all common data quality issues keep emerging constantly. DataOps engineers primarily try to minimize errors and eliminate errors that affect the business the most. Data monitoring as part of DataOps helps build confidence in data systems, ensuring that operations proceed as expected and catching errors before they compound. A deeper view of systems adds the context of what is happening, how it can affect the downstream applications, if it can cause outages, and if it has any severe consequences.
Data pipelines ingest data from sources, transform and enrich it, and make it available for storage, operations, or analytics in a governed manner. Managing multiple processing stages of complex data pipelines needs continuous visibility into the dependencies of data assets and their effect on data quality. Identifying data issues early to avoid any impact on the downstream applications is essential to prioritize and resolve them quickly.
Gartner estimates that data downtime, when data is not available or of poor quality, can cost about $140K to $540K per hour, considering all the lost opportunities of the connected complex ecosystem. Data observability reduces data downtime by predicting, identifying, prioritizing, and helping resolve data quality issues before they impact your business.
How to implement Data Observability in your business
You can take a 5-step approach when planning to implement the data observability capability.
- Understand the purpose of data, metadata and data governance. Metadata management is a cross-organizational agreement on how to define informational assets for converting data into an enterprise asset. Data governance goes hand in hand with metadata management to ensure access to trusted data that is correctly understood throughout the lifecycle and used in the right context.
- Understand data quality, how you can improve it, and how data observability helps fix data quality at scale.
- Identify roles and responsibilities for the data observability capability in your organization.
- Data engineers and DataOps engineers monitor and prevent data quality errors, manage data quality processes, and focus on improving system performance.
- BI analysts, data analysts and data scientists contribute to improving the quality across data sources and models.
- Data strategists and business leaders ensure correct alignment of business and data strategies, optimize resources, and lead the proposed program.
- Evaluate data on the five pillars of data observability:
- Volume: Does your data meet the requirements? Is it complete? This pillar offers insights into the health or your data system, alerting if the health is compromised.
- Freshness: Is your data up-to-date? What is the recency of it? Are there any gaps? The freshness of data is critical for analytics and data-driven decisions.
- Distribution: Is your data field values within the accepted range? Values in the appropriate range build trust in data. Null values or any abnormal values can indicate issues with the file-level health of data.
- Schema: Has the formal structure of your data management system changed? If changed, who made what changes and when? These insights indicate the health of the data system.
- Lineage: Do you have the complete picture of your data landscape? How are your upstream and downstream data sources related? Do you know who interacts with your data at which stages? Data lineage also offers insights into governance and if correct practices are followed.
- You will notice that these pillars are closely related to the data quality dimensions.
- Choose a scalable, automated and predictive data quality tool that enables all to catch errors before they hurt your business.
Sophisticated data observability
Sophisticated data observability capabilities deliver:
- True end-to-end reliability for healthier data pipelines
- Monitoring all your data at-rest without compromising security or regulatory compliance
- Leveraging ML to automatically detect patterns and outliers, anomalies, schema changes, schema or cell value suddenly breaking past trends
- Drilling down to individual records that violate monitoring rules
- Profiling data sets and providing metrics on actual and inferred data types, minimum and maximum values, value frequencies, null value counts, and unique values
- Profiling time series data and performing anomaly analysis, including spike detection or change point detection, while accounting for seasonality of changes in data
Data Observability is now rapidly gaining momentum in DataOps, delivering a deep understanding of the data systems and full business context of data quality issues. These capabilities continuously monitor the five pillars, alerting DataOps before any data issues can edge in. In the coming years, data observability will be considered a critical competency of data-driven organizations.