Last week on our blog, I introduced the Clean Data Manifesto, the call to action we have developed for anyone who works with data. The manifesto is about committing to clean data—which is critical to being able to stand behind the validity of any analysis. As part of this project, we have identified five tenets of proper data preparation practice—tenets we believe must be followed as part of the clean data commitment.
The first tenet, which I discussed last week, is about the importance of prioritizing and setting targets. In essence, this is all about the understanding the goal of your analytics project in order to understand the necessary level of data quality, and what to look out for in your dataset as it relates to your specific project. For the second tenet, we dive into the nuts and bolts of data preparation: identifying issues—early and often.
Clean Data Tenet #2: Identify Issues Early and Often
Once you’ve identified your targets and priorities, you need to be able to identify any issues and confirm your data is sound. As a guide, it’s crucial to keep in mind the 4 Cs of data quality as you prepare your data: the consistency, conformity, completeness and currency of the data.
- Consistency Make sure you have a clear picture of data consistency—meaning, is it statistically valid? Is it internally coherent? Are there extreme values, outliers or anomalies? Consider, as an example, ad fraud. I mentioned this example in my last post to talk about the need for familiarity with the data in order to register that a spike in activity, or an otherwise normal-looking value, could mean a bot is at work. But once you’ve identified that extreme spikes in traffic equate to bots, the next step is to remove those data points for consistency.
- Conformity Does the data adhere to acceptable standards and patterns? Are there any mismatched values? Healthcare is a key example of an industry where companies need to conform to specific standards in order to be actionable. Financial services is another, where organizations’ data and reporting needs to conform to certain regulations.
- Completeness Is all the necessary data included? Are there missing values? Marketing is an industry where we see this issue pop up quite a bit—data on marketing leads might need to be enriched with more information to build out more robust and actionable marketing contact databases so that companies (like Malwarebytes) can ultimately market to leads more effectively.
- Currency How current is it—is it up to date? Is it refreshed regularly enough?
Take PepsiCo as an example: PepsiCo receives massive amounts of data about their inventory and warehousing. In order to optimize their margins and projections—and most importantly, profits—they needs to be able to make sure the inventory data they have is completely up to date so they can make accurate decisions about shipments.
Using automated tools to assess the data through both statistical and empirical approaches will help you uncover any anomalies or Cs that don’t pass muster, and focus on refinement efforts.
The Next Generation of Data Preparation
The truth is, the best way to get ahead of issues is to identify them early and often. This is difficult to do when you’re shipping your data requirements off to IT, and then waiting to receive it back before being able to assess and define new requirements. Or when you’re trying to scroll through columns upon columns of Excel pages or code—these time-consuming tasks and cycles don’t lend well to efficient detection.
New data preparation platforms like Trifacta are visually driven, allowing you to easily spot issues of data quality or completeness. The machine intelligence that powers Trifacta takes that even further, so the system offers signals and suggestions that can help you make the best transformation to address any issues. So that you, ultimately, can get back to what you do best—actually analyzing the data. #CommitToCleanData #NoExcuses