Essential data quality tests for data teams
I wrote a LinkedIn post not long ago about data trustworthiness. In it, I described the data quality dimensions to consider when determining whether the data set you provide end users is trustworthy: completeness, consistency, validity, accuracy, uniqueness, and timeliness. Here, I want to expand on each one of these topics to flesh out the details of each, and how you can actually implement the different tests.
Completeness
Data is complete if all required data is present and there are no missing values in critical fields.
For example, if you created a fact table for sales orders, the expectation is that all sales orders are in that fact table. If there are missing sales orders, the fact table is incomplete. To test for this, you can implement a source-to-target test. Here's a SQL query to illustrate that.
SELECT COUNT(*) -- expectation is 0
FROM source_table s
LEFT JOIN target_table t
ON s.primary_key = t.primary_key
WHERE t.primary_key IS NULL
In addition to not missing rows, there shouldn't be missing values if those values are expected. Turning to the sales orders fact table again, quantity is likely a column which should not contain NULL values. Testing for NULLS would ensure there are no missing values in those critical columns.
Consistency
Consistent data is incredibly important to providing business value. Consistent data means data values should align across systems, or be reconcilable when they don’t. If a customer’s address is in a CRM and an ERP, it should not conflict unless there’s a justified reason.
It's not possible to do robust analysis of different business processes unless the data is at least, in part, consistent. To take another example, if I have an order management system and demand planning system that both contain product information, the SKUs on the products across both systems must be the same if any product-level forecasting and analysis are to be done. Master data management, anyone? 😁
Consistent data is critical for analysis of a thing, e.g., products, but also across things. If there's no way to join products, customers, orders, for example, the data will only be useful in its silo (read: not at all).
Data vault modelling, specifically business data vaults, can really help maintain consistency of data from multiple systems. For more information on data vault, I recommend Patrick Cuba's book.
Validity
Valid data values must conform to the expected type, format, or business rules.
- Expected type: A column is in its expected data type. Your table often creates this constraint for you, but if you have a critical column which should be a double, and its cast to a string in the table, you've got invalid data.
- Format: Does your email column follow email naming standards? How about the phone number column? You can use regular expressions to test that columns are in the format you expect.
- Business rules: This one can vary widely, but imagine you have a column that describes the state of a product: draft, in production, and discontinued. Because the column should have only one of these three values, you can create a test that ensures these are the only values in the column.
Accuracy
Accurate data should reflect the real-world thing it represents. Accurate data is "right" in context, not just formatted correctly.
You've gone through and made sure your sales order table has the columns in the right data type and are formatted correctly, but your business partners are telling you that the product should never have a negative price. Whoops! Your data is inaccurate.
In addition to business rules like this one, you can implement tests to ensure your data reflects the real world. For example, if your customer table contains addresses, you can test that those addresses actually exist using reference data.
Uniqueness
Each entity should appear only once in the dataset unless duplicates are intentional (which should be rare). Duplicates in data can be disastrous in analysis, so testing for uniqueness is a must. Thankfully, testing for duplicates is straightforward for modelled data. Here's a SQL query example of a duplicate check.
SELECT primary_key, COUNT(*) -- expectation of no data
FROM target_table
GROUP BY primary_key
HAVING COUNT(*) > 1
There are some other use-cases where fuzzy-matching can be helpful as well to detect "near-duplicates" (e.g., "John Smith" vs "Jon Smith").
Timeliness
Data should be available when expected and reflect the agreed freshness. Late or stale data reduces usefulness. Here, you can test and alert by...
- Checking if your ETL pipeline finished within the specified SLA
- Checking the last updated record in your table is within the defined threshold (e.g., last fact table update less than 12 hours ago).
Conclusion
Data can be extremely helpful when it's right, and wildly misleading when it's even slightly wrong. Building data systems with these data quality dimensions — completeness, consistency, validity, accuracy, uniqueness, and timeliness — will go a long way to helping your organization make better, data-driven decisions.