If you follow our blog, then by now you’re probably well aware of our Clean Data Manifesto, the tenets of clean data that we’ve identified, and how they help inform sound data preparation practices. We’ve reviewed the reasons why you should prioritize and set targets, identify issues early and often in your data preparation efforts, and why collaboration is the key to strengthening these efforts.
Fourth on our list of clean data tenets is constantly monitor—or why you should always stay vigilant about the quality or “cleanliness” of your data.
Clean Data Tenet #4: Constantly Monitor
Leveraging external, third-party data sets is often critical to enhancing your analysis, as we explained in our previous post.
These data sets typically arrive with the same formatting and require the same transformations—think customer point of sale (POS) data reports that are generated each month. Instead of manually creating a set of transformations for each new report, organizations are increasingly leveraging scheduling and automation capabilities in order to increase their efficiency. Why keep repeating the same transformations over and over again, if you know exactly how these data sets need to be prepared?
However, you can’t leave everything to automation. Modern data preparation—like so many modern business practices—is enabled by the marriage of automated and manual processes. If your data is coming from multiple sources, you’ll need to continually validate its quality and consistency. A best practice is leveraging automation to do the heavy lifting, but constantly monitoring the structure/content of inbound data to ensure your data pipelines and the resulting analysis aren’t unexpectedly impacted. Examine the data by asking questions such as:
- Is today’s data what we expected?
- How is it different than what we have historically seen?
- Are the variances meaningful?
Maintaining your service-level agreement (SLAs) expectations will require regular, repeated data quality assessments, especially in cases when you’re onboarding data from third parties. Determine who is responsible and track when they sign off on data pipelines, so that you can keep things moving forward efficiently—without risking data quality.
A New Approach to Data Preparation
If the technologies you leverage today don’t support scheduling or automation—e.g., one-off scripting or Excel macros—then modern data preparation platforms can help. Look for platforms with robust connectivity in order to fuel your downstream work with a diverse set of sources, and flexible scheduling capabilities that fit the needs of your team.
In order to quickly scan and sign off on the data coming through, platforms that support visual exploration are key. They allow anyone in the organization to identify outliers and data quality issues, whether they are responsible for maintaining the data pipeline or not. Take PepsiCo as an example. In order to try to predict sales, PepsiCo used to manually transform each new customer data set in Excel when it was received. With Trifacta, they’re able to not only look at the entirety of that data at once (because the data volume supported in Trifacta far exceeds Excel’s), but also automate and schedule repeated transformations of data sets. This allows them to dramatically improve efficiency and monitor quality issues before downstream publishing.
Modern data preparation leverages the best of automated and manual processes, knowing when is necessary at what stage in the overall data preparation process. Combining technology with human monitoring is the only way to arm analysts with better process efficiency and true quality assurance—so they can produce results that are trustworthy without creating new issues in the process.