The appeal of being a data analyst or data scientist does not come from cleaning messy data. And yet, it’s the activity that often ends up consuming the majority of their total analytic time—on average, 80 percent of it—while just 20 percent is dedicated to visualizing and analyzing data, creating machine learning models, or other advanced analytics.
It’s easy to glamorize the latter 20 percent, or at the very least deem the initial cleansing of messy data nothing more than a necessary hurdle before beginning the “real” work. But cleaning messy or dirty data deserves more credit. It is a critical part of the broader data preparation process, which, properly conducted, offers insights into the data at hand that allows better questions to be asked of it. And it’s the very foundation of the end analysis. We’ve all heard the expression “garbage in, garbage out,” but it’s important to remember what that really means: fail to adequately clean messy data, and no matter what fancy analytic techniques are applied next, the end analysis will be faulty.
What Is Messy Data?
Rarely do analysts begin working with a new dataset without cleansing it first. That doesn’t always mean the dataset is “messy,” per se, but at the very least, it doesn’t adhere to the required standards for analysis. For example, a messy dataset may contain a certain standardization that needs to be corrected—“California” spelled as “Calif.” when it should be “CA,” for example. Or, phone numbers that contain dashes when there shouldn’t be any added punctuation.
Beyond standardardization, there are many other types of errors that may need to be cleaned in dirty data. These errors can include:
- missing data
- unstructured data
- multiple variables in one column
- variables stored in the wrong places
- observations split incorrectly or left together against normalization rules
- switched columns and rows
- extra spaces
And those are just some of many countless examples. Each of these errors would need to be cleaned and prepared for the data to lead to valuable insights.
Techniques for Data Cleaning
Understanding data cleaning techniques starts with understanding the available data cleaning tools. Historically, analysts have either relied on spreadsheet tools like Excel or programming languages such as SQL, R or Python, depending upon the complexity of the data at hand and/or their own technical expertise. Today, analysts also have the option of using a modern data preparation platform—but more on that below.
No matter the tool they’re using, analysts will begin cleaning data with data parsing, in other words, segmenting out the important elements of a datafile into a structured format. This allows analysts to actually make sense of the data, instead of having to decipher a jumbled mess of values, and let some of the early distinctions and patterns begin to shine through.
After data parsing, analysts will move on to more specific data cleaning tasks. As we reviewed above, this can include a number of issues and there’s no specific order in which to begin data cleaning. An analyst may start by removing extra spaces or duplicate data, or by structuring the data into new rows and columns right away.
It is important to note, however, that data cleansing shouldn’t be considered a one-time job, but an iterative process. As analysts prepare data themselves, it often sparks new ideas in which to transform the data. And even after they’ve moved onto analysis, analysts may return to transform the data differently after they’ve noticed an irregularity or interesting insight.
Challenges with Dirty Data
Historically, when organizations have complained about cleaning dirty data, it relates to one of the following three issues:
- It’s time-consuming.
As mentioned earlier, analysts can spend up to 80% of the total analysis process cleaning dirty data. Though it’s an important part of the analytic process, data cleaning isn’t what analysts were hired to do—nor where their true value lies. And the more time there is spent cleaning data, the more money it costs the organization.
- It’s technical.
In the case of more advanced data projects, organizations must hire costly data scientists or data engineers with advanced programming skills—only to have them spend the majority of their time cleaning data. Programming languages are effective in wrangling big, complex data, but they limit data preparation to a small group of people, which creates a big bottleneck.
- It’s error-prone.
Anything is error-prone if it’s not able to be reviewed by another pair of eyes. And that’s often the case with both Excel and programming languages. Since these tools aren’t visual in nature, it’s often difficult (or near impossible) for others to revise the data cleaning work done with these tools or add suggestions to improve it.
Wrangling Messy Data with a Data Preparation Platform
As organizations seek to scale analytics, the challenges of working with dirty data are only multiplying. That’s why many have adopted data preparation platforms like Trifacta Wrangler, which accelerates the process of cleansing and preparing data for analysts of all stripes.
Here is how Trifacta counters of the challenges that come with messy data:
- Time constraints.
Trifacta offers a visual data preparation experience guided by machine learning, which accelerates the overall process by up to 90 percent. Analysts won’t have to spend hours combing through Excel datasets or writing code to clean data. Instead, a few clicks and swipes with Trifacta can generate the same results.
- Technical constraints.
It isn’t necessary to know programming languages in order to use Trifacta—but it’ll offer the same power as a programmer would have. Mass changes to data at scale, understanding the statistics of sample data, and complex transformations are just a few of the capabilities that any kind of analyst can perform in Trifacta.
- Inaccurate data.
Datasets need to be prepared effectively as well as efficiently. Trifacta automatically surfaces errors, outliers, and missing data to be corrected so that analysts don’t miss a beat. Plus, the machine-learning generated suggestions allow analysts to select the best possible transformations for their data.
Trifacta Wrangler was designed to make data cleaning more efficient and more manageable. Now when analysts are faced with a messy set—and all datasets are messy in some form—they can easily wrangle it into usable data for analysis.