Garbage in, garbage out. When it comes to data, this well-known aphorism can prove particularly pernicious: errors or omissions in data can undermine otherwise meticulous analyses, often in ways that are easy to overlook. Rather than informed data-driven decisions, we may fall prey to data-driven delusions.
Since our founding, our goal at Trifacta has been to help people understand what is in a dataset (warts and all) and transform the data to make it actionable and fit for downstream use. Assessing and improving data quality is at the heart of this mission. While there are many different dimensions of data quality (accuracy, completeness, timeliness, etc.), the central concern can be simply stated: is my data fit for purpose?
However, data quality is a constantly moving target. As new data lands, it needs to be checked and reconciled with existing data. Schemas may change, subtle errors may sneak in, and distributions may drift over time. In addition, the intended purpose for the data may change as use cases evolve, bringing new quality requirements.
Working with customer partners, we have launched a new initiative that we call Adaptive Data Quality (ADQ). Traditional data quality rules—logical checks that the data adheres to specified requirements—are a valuable component, and we are excited to now include them as first-class citizens in the Trifacta experience. However, we want to also look beyond traditional rules to consider new ways to express, update, and act upon data quality needs. In particular, we envision data quality tools that are assistive, interpretable, actionable, and ongoing.
Assistive. Traditional data quality rules are logical statements that check the data to ensure that they do not violate various constraints; for example, we might require that values are unique, non-null, or do not exceed an expected range. While valuable, specifying these rules by hand can be tedious and may require data management or mathematical expertise. In keeping with Trifacta’s approach to predictive interaction, we seek to assist users to more easily and accurately express data quality concerns.
Our first step down this path is data quality rule suggestions: based on a statistical profile of the data, we automatically suggest possible quality rules, including integrity constraints, formatting patterns, and dependencies between columns. Even for data quality experts, these suggestions can reveal otherwise overlooked anomalies. On multiple occasions we’ve even discovered subtle inconsistencies in data sets thought to be “clean.” Going forward, previous rules created by others can also serve as the basis for recommendations, similarly to how Trifacta supports collaborative data transformation suggestions.
Data quality rules work well when the underlying issue can be stated as a logical constraint. However, it is sometimes easier for a person to recognize a value is erroneous than to precisely state why it is erroneous. Given enough examples of “clean” or “dirty” values, machine learning methods can be used to create classifiers for probabilistic data quality rules. Following recent database research efforts, we have started prototyping ways to learn new quality rules from user annotations, providing an additional avenue for expressing data quality concerns.
Interpretable. A data quality rule is of limited value if analysts and data engineers can not readily understand its meaning. One pillar of our adaptive data quality approach is to provide inspectable visualizations: rather than simply list the percentage of records that pass or fail a test, the Trifacta interface allows detailed exploration and filtering, allowing users to examine the specific records that pass or fail a rule, and how they might relate to each other.
Probabilistic rules impose another layer of complexity: they may not always be “correct” due to false positives or false negatives, and it may not always be clear what features of the data are driving a prediction. Building on recent efforts in machine learning interpretability, we are invested in helping users understand the reliability of statistical quality predictions and surfacing the features of the data most strongly associated with those predictions.
Actionable. Data quality tools should also be actionable: not only revealing issues, but providing steps to help fix them. In some cases data quality issues will inevitably fall out of the scope of a given tool, requiring additional data collection or upstream corrections. Where possible, however, quality rules should be paired with corresponding transformations for users to consider. For example, if the data violates a required formatting pattern (say, for dates, telephone numbers, or addresses) Trifacta can suggest transformations to standardize those values.
Acting on data quality information is not limited to corrective transformations; it can also involve dynamically reconfiguring large data processing workflows. When orchestrating a collection of transformation jobs at scale, computed quality information should inform subsequent alerting and scheduling. For example, if an updated data table is of unacceptably low quality, the responsible people should be alerted. Meanwhile, downstream jobs may be postponed until data quality is restored, conserving resources and preventing problematic data from making it into production use.
Ongoing. In production contexts data quality is rarely a “one and done” activity, as both data and use cases can evolve and shift. Consequently, monitoring quality over time and across data batches is a critical component of adaptive data quality. Seeing a trend of historical data quality indicators provides more context than point values alone. Users should be alerted to changes of schema and data distributions that may affect fitness for use. Both traditional and probabilistic data quality rules may also need to adapt in response. As such, new experiences for ongoing monitoring of data quality and management of data quality rules are essential.
With the release of data quality rules and automatic suggestions, we have just begun our adaptive data quality journey at Trifacta. These features are not just stand-alone aids: they are woven into the core Trifacta user experience, which allows users to view, transform, and validate data in an integrated way. To prevent garbage in, garbage out (GIGO), we are excited to continue developing next-generation data quality tools that are assistive, interpretable, actionable, and ongoing (AIAO).