Join us on April 7-9, 2021

The first industry event focused on data engineering

Register Today
All Blog Posts

Commit to Clean Data: Prioritizing & Setting Targets

July 30, 2018

Of all the ways you can spend your time as an analyst—especially up to 80 percent of your time—the janitorial work of preparing data isn’t a favorite for most people. It has historically been a manual task: monotonous, time consuming and not particularly enjoyable. But as most analysts have experienced in one way or another, how well (or not) you prepare your data can dictate the quality and trustworthiness of your analysis. Committing to clean data is critical to being able to stand behind the validity of your analysis.

In an effort to honor the significance of this work, and its notorious challenges, we’ve developed what we’re calling the Clean Data Manifesto; as a call to action to anyone who works with data. As part of this manifesto, we have identified five tenets of proper data preparation practice. The first of these tenets is prioritizing and setting targets.

Clean Data Tenet #1: Prioritizing and Setting Targets

For any analytics project, clean data is essential. That being said, context matters. The definition of “good” data will often vary from project to project, even for the same data. An in-depth understanding of your use case will ultimately determine the data quality issues that matter most and what “good enough” looks like. What data is most essential to the success of your project? What level of quality is really necessary? How significant are the risks of bad data? Organizations can spend untold resources in an attempt to achieve pristine data quality, but that amount of investment vs. how much it will move the needle may not add up. For example, some use cases may necessitate remediating every null value; others may not. Know upfront what is important to your use case so you can maximize return on effort (RoE) and define clear data SLAs.

In other instances, it may not be a matter of how much work is put forth to clean the data, but understanding what, exactly, needs to be addressed. For example, to those not familiar with ad fraud detection, it may not register that a spike in activity, or an otherwise normal-looking value, means a bot is at work. Or, to those unfamiliar with complex genome data, it may be difficult to understand the specifics of what needs to be extracted, which data points are most critical for identifying patterns and how to identify instances of different naming conventions for the same thing (or vice versa).

Whether it’s understanding the impact of quality levels on your project, or understanding the nuances of the data itself, in both instances, this underscores the need for the people who know the data best to do this work themselves, rather than enlisting the help of IT professionals who aren’t as familiar with the data or what it will be used for downstream. Without close business alignment, it can be difficult to answer the questions needed of your data—or even ask them in the first place.

A New Approach to Data Preparation

Historically, it was difficult for analysts to do this work themselves. Data preparation was often limited to IT, through complex coding practices that only IT could undertake. But IT didn’t have the necessary context to be able to prepare the data efficiently. They didn’t have the deep understanding needed to identify the insights and additional questions that can be spurred during preparation and can help to reshape the data during the process in new and useful ways. Analysts would typically define new requirements again and again after seeing the resulting data — a cycle of unnecessary iteration between teams that can cost organizations billions.

User-friendly data preparation platforms like Trifacta are changing the way that non-technical analysts are able to interact with the data they know best. With an intuitive interface guided by machine learning, business analysts can now prepare data themselves.

The steps to prepare data for one use case may be very different from what is required for another. That’s why it’s so important to know the context inside and out in order to adequately prepare the data — and to ultimately produce analyses founded on data that is clean, suitable and appropriate. #CommitToCleanData #NoExcuses

Sign up for Wrangler to get started.