This article was written by Ankur Gupta and originally appeared on the Collibra Data Quality Blog here: https://www.collibra.com/blog/the-7-most-common-data-quality-issues
Data-driven organizations are depending on modern technologies and AI to get the most out of their data assets. But they struggle with data quality issues all the time. Incomplete or inaccurate data, security problems, hidden data – the list is endless. Several surveys reveal the extent of cost damages across many verticals due to the problems associated with data quality.
What are the most common data quality issues?
Poor data quality is enemy number one to the widespread, profitable use of machine learning. If you want to make technologies like machine learning work for you, you need a sharp focus on data quality. In this blog post, let’s discuss some of the most common data quality issues and how we can tackle them.
1. Duplicate data
Modern organizations face an onslaught of data from all directions – local databases, cloud data lakes, and streaming data. Additionally, they may have application and system silos. There is bound to be a lot of duplication and overlap in these sources. Duplication of contact details, for example, affects customer experience significantly. Marketing campaigns suffer if some prospects get missed out while some may get contacted again and again. Duplicate data increases the probability of skewed analytical results. As training data, it can also produce skewed ML models.
Rule-based data quality management can help you keep a check on duplicate and overlapping records. With predictive DQ, rules are auto-generated and continuously improved by learning from the data itself. Predictive DQ identifies fuzzy and exactly matching data, quantifies it into a likelihood score for duplicates, and helps deliver continuous data quality across all applications.
2. Inaccurate data
Accuracy of data plays a critical role for highly regulated industries like healthcare. Looking at the recent experience, the need to improve the quality of data for COVID-19 and subsequent pandemics is evident more than ever. Inaccurate data does not give you a correct real-word picture and cannot help plan the appropriate response. If your customer data is not accurate, personalized customer experiences disappoint, and marketing campaigns underperform.
Inaccuracies of data can be traced back to several factors, including human errors, data drift, and data decay. Gartner says that every month around 3% of data gets decayed globally, which is very alarming. Quality of data can degrade over time, and data can lose its integrity during the journey across various systems. Automating data management can help you to some extent, but dedicated data quality tools can deliver much better data accuracy.
With predictive, continuous and self-service DQ, you can detect data quality issues early in the data lifecycle and proactively fix them to power trusted analytics.
3. Ambiguous data
In large databases or data lakes, some errors can creep in even with strict supervision. This situation gets more overwhelming for data streaming at high speed. Column headings can be misleading, formatting can have issues, and spelling errors can go undetected. Such ambiguous data can introduce multiple flaws in reporting and analytics.
Continuously monitoring with autogenerated rules, predictive DQ resolves ambiguity quickly by tracking down issues as soon as they arise. It delivers high-quality data pipelines for real-time analytics and trusted outcomes.
4. Hidden data
Most organizations use only a part of their data, while the remaining may be lost in data silos or dumped in data graveyards. For example, customer data available with sales may not get shared with the customer service team, losing an opportunity to create more accurate and complete customer profiles. Hidden data means missing out on discovering opportunities to improve services, design innovative products, and optimize processes.
If hidden data is a data quality issue for your organization, trust predictive DQ for auto-discovery as well as the ability to discover hidden relationships (such as cross-column anomalies and ‘unknown unknowns’) in your data. Consider investing in a Data catalog solution, too. Best-in-Class companies are 30% more likely to have a dedicated data catalog solution, concludes a recent survey.
5. Inconsistent data
When you’re working with multiple data sources, it’s likely to have mismatches in the same information across sources. The discrepancies may be in formats, or units, or sometimes spellings. Inconsistent data can also get introduced during migration or company mergers. If not reconciled constantly, inconsistencies in data tend to build up and destroy the value of data. Data-driven organizations keep a close watch on data consistency because they want only trusted data powering their analytics.
The continuous DQ automatically profiles datasets, highlighting the quality issues whenever data changes. For DataOps, a comprehensive dashboard helps to prioritize triage quickly by impact ranking. The adaptive rules keep learning from data, ensuring that the inconsistencies get addressed at the source, and data pipelines provide only the trusted data.
6. Too much data
While we focus on data-driven analytics and its benefits, too much data does not seem to be a data quality issue. But it is. When you are looking for data relevant to your analytical projects, it’s possible to get lost in too much data. Business users, data analysts, and data scientists spend 80% of their time locating the right data and preparing it. Other data quality issues become more severe with the increasing volume of data, especially with streaming data and large files or databases.
If you are struggling to make sense of the massive volume and variety of data arriving from various sources, we have the answer. Without moving or extracting any data, the predictive DQ can scale up seamlessly and deliver continuous data quality across multiple sources. With fully automatic profiling, outlier detection, schema change detection and pattern analysis, you don’t need to worry about too much data.
7. Data Downtime
Data-driven companies rely on data to power their decisions and operations. But there can be short durations when their data is not reliable or not ready (especially during events like M&A, reorganizations, infrastructure upgrades and migrations). This data downtime can affect the companies to a great extent, including customer complaints and poor analytical results. According to a study, about 80% of the time of a data engineer is spent on updating, maintaining and assuring the quality of the data pipeline. The long operational lead time to go from data acquisition to insight creates a high marginal cost to ask the next business question.
The reasons for data downtime can vary from schema changes to migration issues. The complexity and magnitude of data pipelines can be challenging too. What’s essential is monitoring data downtime continuously and minimizing it through automated solutions.
Accountability and putting in SLAs can help control data downtime. But what you really need is a comprehensive approach to ensuring constant access to trusted data. The predictive DQ can track issues to continuously deliver high-quality data pipelines, always ready for operations and analytics.
In addition to these above issues, organizations also struggle with unstructured data, invalid data, redundancy in data, and data transformation errors.
The most common data quality problem statements
|Data quality problem statement||Description|
|Tell me when something suddenly changes in my data||Any column, schema or cell value that suddenly breaks its past trend. Would require thousands of conditional statements and their ongoing management unless you do behavioral analytics for automatic change control.|
|How many phone number formats are in this column?||This DQ problem is common in STRING or VARCHAR fields where you can end up with many different formats. For example – a zip code or phone number or SSN for example. It is helpful to find the majority formats and show the topN data shapes that make up the column values. This helps identify typos and strange formats.|
|Has my row count dropped on any dataset?||It can be important to know if the volume of a dataset drops, also known as a row count drop. When a dataset suddenly has fewer rows than normal it can mean data is missing in the file or table.|
|The NULL values problem||The null check is generated from the columns’ past behavior or descriptive statistics.|
|I need to detect outliers per some grouping.||Sometimes, basic column level outliers do not solve the issue. This is applied when a user wants to find egregious numeric values relative to the population.|
|I need DQ in my data pipeline.||I already have a data pipeline in Python or Scala or Spark and want to control the DQ operations. Some call this an ETL pipeline, making this ETLQ.|
|The Bill Gates, William Gates fuzzy matching problem||This problem is not suited for conditional statements. You need to opt into any grouping of columns and find exact or similar records (fuzzy matching). This can be done at the column or record level. Identify duplicate or redundant data within a dataset via fuzzy or exact matching.|
|I need to compare two tables.||It is common to need validation when loading data from a file into a database table or from a source database into a target database to identify missing records, values and broken relationships across tables or systems.|
|I’d like to see a heatmap of where all my data errors exist.||Visualize a blind spot heatmap by time, business units and scheduled jobs.|
|The state and zip code don’t belong to each other in my dataset.||Define relationships via identifying cross-column anomalies. Commonly used for hierarchical and parent/child mis-mappings.|
How do you fix data quality issues?
Data quality is a critical aspect of the data lifecycle. If you want to address data quality issues at the source, the best way is to prioritize it in the organizational data strategy. The next step is to involve and enable all stakeholders to contribute to data quality.
Finally, tools. Choose tools with intelligent technologies to improve the quality as well as unlock the value of data. Incorporate metadata to describe and enrich data in the context of who, what, where, why, when, and how. Consider data intelligence to understand and use your organizational data in the right way.
When evaluating data quality tools, look for tools that deliver continuous data quality at scale. Along with them, use data governance and data catalog to ensure that all stakeholders can access high-quality, trusted, timely, and relevant data.
Issues in data quality can be considered as opportunities to address them at the root and prevent future losses. With a shared understanding of data quality, leverage your trusted data to improve customer experience, uncover innovative opportunities, and drive business growth.