With the growing adoption of big data infrastructure technologies like Hadoop has come increased awareness of the different activities involved in successful Hadoop-based analyses. Specifically, users have come to appreciate that “data wrangling” is a crucial first step in preparing data for broader analysis – and one that consumes a significant amount of time and effort. However, too many people regard wrangling as janitorial work; as an unglamorous rite of passage before sitting down to do “real” work.
In fact, wrangling is as much a part of the data analysis process as the final results. Wrangling, properly conducted, gives you insights into the nature of your data that then allow you to ask better questions of it. Wrangling is not something that’s done in one fell swoop, but iteratively. Each step in the wrangling process exposes new potential ways that the data might be “re-wrangled,” all driving towards the goal of generating the most robust final analysis.
A good analogy for data wrangling is food preparation. The work it takes to move from raw ingredients to nutritious and palatable meals is a critical activity, and one that largely dictates the quality of your meal. ETL in traditional data work flows is the equivalent of big, commercial food preparation – doing it well means doing it efficiently and consistently. At the other end of the spectrum, and where I’ll spend more time here, are data wrangling tasks that have a more “exploratory” flavor – where figuring out what you can make is as important as the making.
At Trifacta, we think about data wrangling as a process that includes six core activities:
Before I explain each one, a quick point on order: these activities do not always proceed directly from discovering to publishing, one following the other. Instead, as we will soon see, time spent, for example, enriching data might generate more ideas for cleaning it.
Another point: While each of the six wrangling activities has a distinct meaning, a particular data preparation task might fall in a gray zone between two or more of them. For example, cleaning an attribute/field in a dataset might consist of replacing it (i.e., enriching) with a more standard representation (e.g., augmenting an address with a sales region ID). This is nothing to worry about, especially if the end result is a more insightful piece of analysis.
Finally, while each of these activities may sound labor-intensive and even tedious, remember that modern data wrangling software is heavily automated – from a single keystroke or mouse click, these tools use a suite of algorithms that can identify patterns and leverage them to suggest data transformations. And, breaking with our food analogy, modern software enables undo/redo in ways the physical world does not; which means experimentation is possible on a scale previously unseen.
Now, to the six activities!
1. Discovering is something of an umbrella term for the entire process; in it, you learn what is in your data and what might be the best approach for productive analytic explorations. For example, if you have a customer data set, and you learn that most of your shoppers are from a single part of the country, you’re going to keep that in mind as you proceed with your data work. If all of your users are in the Northeast, for example, that will suggest that you might be interested in winter weather patterns if you want to properly understand their behavior. In the world of food, discovering is akin to learning how to leverage eggs, or figuring out that your garlic is more intense than you’re used to.
2. Structuring is needed because data comes in all shapes and sizes. For example, you might have a transaction log where each entry might have one or more items associated with it (think shopping basket). To conduct an inventory analysis, you will likely need to expand each transaction into individual records for each purchased item. Alternatively, you might want to analyze which products are often bought together. In this case, expanding each transaction into every pair of purchased items might be appropriate. With food, structuring is often about cutting (chopping, dicing, julienning, etc.). But it also involves actions like blending, emulsifying, wrapping or infusing.
3. Cleaning involves taking out data that might distort the analysis. A null value, for example, might bring an analytic package to a screeching halt; you may well want to replace it with a zero or an empty string. You might want to standardize a particular field, replacing the many different ways that a state might be written out — such as CA, Cal and Calif — with a single standard format. In broad strokes, cleaning requires knowledge about data quality and consistency – knowing how various data values might impact your final analysis. Similarly, cleaning food is often about removing the bits that you don’t want or need – e.g., de-seeding a pepper to cut the spiciness or rinsing pickled ingredients to soften their pungency or trimming the fat from a cut of meat to balance the texture.
4. Enriching allows you to take advantage of the wrangling you have already done to ask yourself: “Now that I have a sense of my data, what other data might be useful in this analysis?” Or, “What new kinds of data can I derive from the data I already have?” In other words, enrichment is often about joins and complex derivations (like convolutions and converting a time stamp to a day of week). Purchase transaction data, for example, might benefit from data associated with each customer’s profile or historical purchase patterns. Similarly, a car insurance underwriter might want to know crime rates in the neighborhoods of the customers they insure to better estimate risk. This new information is sometimes available in in-house databases, but, and increasingly so, may be sourced from marketplaces for third-party data. The quintessential example of enrichment in the food world is the addition of spices – salt, pepper, turmeric, saffron, etc. The intent is to complement what you have to achieve a better final result.
5. Validating is the activity that surfaces data quality and consistency issues, or verifies that they have been properly addressed by applied transformations. Validations should be conducted along multiple dimensions. At a minimum, assessing whether the values of an attribute/field adhere to syntactic constraints (e.g., boolean fields encoded as ‘true’/‘false’ as opposed to ‘1’/’0’ or ‘T’/’F’) as well as distributional constraints (e.g., birth dates in a customer database should be fairly uniformly distributed over months of the year). Additional validations might involve cross-attribute/field checks like ensuring all negative bank transactions have the appropriate transaction type (e.g., ‘withdrawal’, ‘bill pay’, or ‘check’). Food evaluations are similarly multi-dimensional – checking things like temperature, taste, appearance and texture.
6. Publishing refers to planning for and delivering the output of your data wrangling efforts for downstream project needs (like loading the data in a particular analysis package) or for future project needs (like documenting and archiving transformation logic). Some data software tools have dramatic performance increases when they encounter data structured in a certain fashion. Efficient data analysts will know this, and will wrangle the data to match its format for the data’s eventual target. Across projects, it often makes sense to replicate a set of data wrangling steps/practices for re-use on other datasets. Experienced data analysts maintain a library (often personal, sometimes shared) of common transformation logic that they can leverage new projects. In food preparation, there are actions that can be taken to speed up cooking times (like de-boning a chicken or pre-soaking ingredients) or to improve the flavor or texture development of a final dish (e.g., peeling the skin of vegetables).
So that is how we at Trifacta think about data wrangling. Two points are worth repeating. First, that you can, and often will, move back and forth among the different activities until you think you’ve nailed your data preparation. Second, modern data wrangling software itself does a lot of the grunt work for you, leveraging sophisticated algorithms and built-in knowledge of downstream constraints to guide users into good wrangling actions.