This article was written by Collibra and originally appeared on the Collibra Data Quality Blog here: https://www.collibra.com/us/en/blog/what-is-data-quality
What is data quality and why is it important?
The Data Management Association (DAMA) defines data quality management as the “Planning, implementation and control of activities… to assure data assets are fit for consumption and meet the needs of data consumers.”
In his book Getting in Front on Data, Tom Redman states, “Data is of high quality if it is fit for its intended use (by customers) in operations, analytics, decision-making, and planning. To be fit for use, data must be “free from defects” (i.e., “right”) and “possess desired features” (i.e., be the “right data”). Data quality indicates if data is fit for use to drive trusted business decisions.
The key drivers of data quality are:
- Exponential growth in volume, speed, and variety of business data
- Multiple systems leading to a larger, more complex and more expensive hidden data factory
- Increasing pressure of compliances – Regulations such as GDPR, BCBS 239, CCAR, HIPAA, et al. require data auditing and reporting
- Data migrations – When moving large volumes of data to the cloud or to new storage, it’s important to identify missing records, values, and broken relationships across tables or systems
- High-performing AI initiatives – Monitoring data drift helps detect accuracy and performance of analytic models over time
- Customer experience – Creating a personalized experience for customers requires fresh and complete data about the individual recipient
Organizations today depend on data for every decision and consider data as a significant enterprise asset. As business analysts and data scientists struggle for trusted data for powering their solutions, data quality is assuming a higher priority in business data strategy.
What is good data quality?
Good quality data represents the business scenario correctly and helps approach the problem at hand more precisely. You can use the foundation of good quality data to derive trusted information, driving trusted business decisions. Superior business results can further fuel the case of data quality in a continuous improvement cycle.
Confidence in data is critical for using data collaboratively across the enterprise and good data quality is an indicator of how quickly you can achieve data-to-value.
Why is data quality important?
Incomplete, duplicated, redundant, or inaccurate data is commonplace in business, resulting from human errors, siloed tools, multiple handovers and inadequate data strategy. Businesses routinely face frustrated customers, higher operational costs, or inaccurate reports due to poor data quality. MIT Sloan Management Review research points out that the cost of bad data is an astonishing 15% to 25% of revenue for most companies.
Streamlining operational processes is a critical use case for data quality.
- Marketing campaigns often see fewer results because of wasted efforts on incorrect addresses or duplicated customers
- Suppliers send the wrong material or quantity due to mismatch in data across departments
- Reconciling inconsistent data for compliance requires higher manual efforts, costing much more, or delaying the process
Data quality strongly impacts the agile response to business changes.
- Inaccurate or old data fails to identify new opportunities
- Analysis based on poor quality data cannot indicate if the current campaigns are working or need changes
- Financial reporting may not represent the correct picture with incomplete or obsolete data, affecting timely actions
As organizations rush to embrace big data and AI-enabled automation, they need to appreciate good quality data even more.
How do you determine data quality?
Measuring data quality in the context of specific domains or tasks is often more relevant and practical. You can begin with taking an inventory of your data assets and choose a pilot sample data set. Assessing the data set for validity, accuracy, completeness, and consistency is the next step. You can also evaluate the instances of redundant, duplicated, and mismatched data. Establishing a baseline on a small data set enables quick scaling of the efforts.
Watch this video to learn what data quality may mean to you.
Rule-based data quality management is an excellent approach, where you can define business rules for specific requirements. You can also establish targets for data quality and compare them with the current levels. Setting targets facilitates continuous measurement, discovering opportunities for improvement, and good data hygiene.
As per Gartner, data quality improvement efforts tend to focus narrowly on accuracy. Data consumers have a much broader definition of data quality than technical professionals may realize. For example, data accuracy is meaningless unless data is accessible, understandable, and relevant.
What is an example of data quality?
What happens when someone urgently rushes to an emergency procedure? Healthcare staff quickly recovers digital patient records, which are expected to present complete information all the time. If the patient data fails to indicate allergies or ongoing medications, the consequences can be severe. Good quality patient data can ensure that all the treatments correctly address the unique healthcare needs of individuals at any point in time.
In business, good data quality can assure that your data is fit to support the analysis and spearhead your efforts in the right direction.
Breaking down data quality vs. data integrity
As we’ve seen, data quality is primarily a metric of the data’s reliability and accuracy. It refers to the ability to use the data for an intended business purpose. That might include informing, planning and driving decision-making. Data quality relies on a number of values, including:
- Consistency. Data entries are standardized and consistent.
- Timeliness. The data is up-to-date.
- Uniqueness. Data sets don’t contain repeated or irrelevant entries.
- Completeness. Data should be representative, giving a clear picture of real-world conditions.
- Validity. The data conforms to the formatting required by your business.
In a way, data quality is a subset of data integrity. Data integrity not only requires that data be accurate, consistent, and complete, but also that it be in context. Another way of saying that is that data integrity is the assurance of data quality and consistency over its complete lifecycle. To achieve data integrity, there will be no unintended changes or alterations when the data records are modified, updated, or integrated.
Identifying and addressing common data quality mistakes
There are a number of common causes of data quality issues within organizations. We’ve already discussed some big-picture reasons, but here are some particular ways to fix common mistakes:
1. Refine your data capture approach
For example, when generating leads, a well-designed form can do a lot of the work of cleaning up your data at the very start of the process. Take advantage of restricted values, pre-populated fields, and other requirements that encourage users to be precise when inputting their data — and ensure they finish the process and submit it.
2. Standardize team’s approach to data entry
For example, one of the most common ways data is updated is through the sales team. Educating your team members about the purposes the data will be put to can help make sure they complete all the necessary fields, and do so accurately. The sales team can bring in a large amount of data, but if isn’t formatted correctly, you’re missing out on most of its value.
3. Catch and correct duplicate records
Duplicate records are a serious problem that can frustrate sales and muck up automated marketing processes — all while costing you money. Catching duplicate records as early as possible will head off most of the damage they can cause. For that reason, it’s important to establish an alert system to notify you of duplicated records. Analyzing and developing reports to determine how duplicates are generated can help you fix systemic issues.
How to improve the quality of your data
Identifying and acknowledging the problem is the first step towards solving it. The recent global crisis survey by PwC survey highlighted the importance of accurate data during crisis management. Data quality is affected by various factors, and they all have their roots in the silos of multiple data sources. You must take a comprehensive approach to understand data and overcome the challenges of managing its quality.
Quoting Tom Redman again – There are two interesting moments in the lifetime of a piece of data: the moment it is created and the moment it is used. The whole point of data quality management is to connect those moments in time — to ensure that the moment of creation is designed and managed to create data correctly, so everything goes well at the moment of use.
- Metadata Management: Metadata management leverages the cross-organizational agreement on defining informational assets for converting data into an enterprise asset.
- Data Governance: Data governance is a collection of practices and processes to standardize the management of data assets within an organization. A robust data governance foundation builds trust in data.
- Data Catalog: Data catalog empowers users to quickly discover and understand data that matters, helping choose trusted data to generate impactful business insights.
- Data Matching: Data matching identifies possible duplicates or overlaps to break down data silos and drive consistency.
- Data Intelligence: Data intelligence is the ability to understand and use your data in the right way. A comprehensive approach to data intelligence promotes and delivers high-quality data.
Data quality best practices focus on establishing an enterprise-wide initiative, defining measurement metrics, streamlining procedures, and performing regular audits.
Predictive and continuous data quality offers unique capabilities of autonomous rule management, continuous data-drift detection, and automated data profiling. You can enhance these capabilities with data governance, data privacy, data catalog, and data lineage to have end-to-end data pipelines control, bring full business context to data quality, and deliver trusted analytics and AI in a scalable way.
Gartner estimates that by 2022, 60% of organizations will leverage ML-enabled technology for data quality improvement. How beneficial would it be for your organization if you were able to automate your data quality rule management process and continuously increase the quality of your business-critical data sources and data elements?