We’ve covered a lot of ground in our Clean Data Manifesto series, outlining the 5 tenets of clean data. We’ve reviewed the reasons you should prioritize and set targets, identify issues early and often in your data preparation efforts, why collaboration is the key to strengthening these efforts, and why it’s important to constantly monitor your day-to-day efforts.
The final tenet in our series on clean data is ensure transparency.
Clean Data Tenet #5: Ensure transparency
The ability to trust your data hinges on trusting the process you use to clean it. This means having a full audit trail to understand lineage and chain of custody.
It’s not enough to just communicate your results—you need to communicate the steps that got you there. Show your work. This is critical for meeting external compliance requirements (take the regulatory reporting needs of financial services firms as an example — banks are required to fully document their data systems and data transformation efforts) as well as for your own internal credibility. To ensure your results can be reproduced, understood and trusted, you have to be able to audit how and when the data was transformed, as well as who transformed it. Be transparent about the ways in which you’ve altered the data in order to build trust, ensure consistency and remove potential bias.
This is a critical step that needs to be factored into the timeline of your ultimate deliverable. In addition to to being time consuming, doing this work manually can be error prone and quickly fall out of date. Also, the ongoing maintenance of this audit trail and the associated change management process can easily dwarf the upfront cost. Stakeholders may need to sign off on the data transformation one time or multiple times throughout the process. This is why it’s important that you understand the chain of custody up front, and that all of the involved stakeholders are aligned from the beginning on the prioritization and targets of the analysis.
A New Approach to Data Preparation
The cost and complexity of this kind of data governance necessitates a metadata driven approach that is self-documenting and provides built-in lineage, audit and controls. Legacy approaches that involve writing tons of custom code or ad-hoc manipulation in spreadsheets makes it impractical and, in some cases impossible, to provide the right level of transparency. For example, when scripting is used, understanding how the data was transformed requires a full code review, often walking through thousands of lines of Python, Java, C++, etc. to ensure integrity. At the other end of the spectrum, in situations where spreadsheets are used, the changes are not rules driven, therefore not consistently repeatable, nor collectively verifiable. Worse yet, without any metadata, many of the changes made in spreadsheets are destructive, leaving no clues as to what changed, how it changed and who changed it.
This is why it’s essential to use a data preparation platform that will systematically track how your data has been transformed, building in the governance and controls that will establish and maintain data provenance in an automated way. If it’s a separate process divorced from the work itself, it will add unnecessary overhead and slow you down, or it won’t get done. If it’s locked up in complex code, it will be largely inaccessible—too hard to review, share and validate. Solving this involves balancing efficiency in doing the work with the need to make changes to the data self-evident and unambiguous. Data prep platforms ensure that auditability is a natural byproduct of the act of cleaning the data. How the data is transformed should be easily discoverable through simple steps that define recipes and visualizations that document overall data flows. This ensures proper, predictable outcomes that are verifiable every step of the way.