This article was originally published on SD Times.
Today we work with data that has grown up in diversity, scale and complexity — this applies to not only data scientists and academic researchers, but also the rest of us. Business analysts across a spectrum of industries are asked to include larger volumes of data in their work, now pervasive due to diminishing costs of collection and storage. Answering real analytic questions that drive business value means adapting methodologies to the reality of the data at hand. For this, new data preparation tools are gaining adoption, helping business users bring their domain expertise to bear on bigger, thornier data challenges. Based on our experiences navigating these transitions, we’ll share some best practices for evolving data workflows to handle increasing data volumes.
1.) Recognize the limitations inherent to working with full datasets
When presented with a spreadsheet, a business analyst might visually scan all the columns and rows, filter out known, irrelevant values, and run computations to quality check the data before loading it into a BI tool for visualization. But how might you deal with a dataset that’s totally new to you? What if, instead of scrolling through one thousand rows, you were confronted with the daunting prospect of inspecting one million rows?
Unlike previous data projects, it is impractical to work with all the data at once in this scenario. Visual ballparking is no longer sufficient for assessing data quality. When data volumes exceed desktop or system hardware capabilities, each attempt by the user to edit the data can slow a data tool to a crawl. Instead, structuring, cleaning, and aggregating bigger datasets means starting with a smaller, more manageable subset of the data, which enables fast exploration, iteration, and refinement. From there, an analyst’s deep familiarity with the business questions at hand can accelerate understanding of the dataset, as well as progressively evolving the dataset to the desired end goal.
2.) Create generalized rules to transform your data
At bigger data volumes, modifying data values one by one is impossibly time-prohibitive. Instead, it’s helpful to abstract up a level and design data transformation rules that can be systematically applied to groups of columns or rows. For example, rather than changing the value of cell C25, you might define a generalized condition identifying values in column C that contain non-alphabet symbols, and then apply an edit to remove all symbol characters in the column. This approach leverages ever more powerful compute systems’ ability to process large amounts of data once the transformations are designed.
3.) Use relevant subsets of data during the design process
Transformation rules are ideally designed on a relatively small subset of the data, which lets you explore the data with lightning responsiveness and get to high-level conclusions faster. Of course, at each stage of the preparation process, it’s important to pick the right subset of the data.
When starting out, the first task is to understand a new dataset at a bird’s eye view. Some typical questions include: Do I have the right columns? Do the columns contain the right type of data? Do the range and distribution of values match my expectations? A random sample across the big dataset lets you resolve these high-level questions with enough confidence to decide how to proceed with messy data and make it suitable for downstream consumption in analytics. Early on, you are not targeting analytic precision or exact results. Rather, the goal is to develop an overall picture of the data, and to identify next steps.
Assuming some part of the data needs to be fixed up, you would naturally want to zoom in on a particular subset that is of interest. Maybe you’re targeting a certain set of rows for data quality issues, or maybe your goal is to roll the data up to an aggregated results table. This is where deep familiarity with the data goes hand in hand with knowledge of the business questions at stake. Connecting the dots between the current state of the data and the desired end goal, you can be specific about the data that is relevant for your task. Narrowing down to the data of interest allows for the speed and interactivity required by exploration-driven data preparation, while ensuring high confidence in the final results.
4.) Plan to validate and iterate
The final step is to verify the results, with additional refinement and iteration if necessary. Assuming that you identified certain subsets of data as prime examples for understanding and crafting the preparation steps, you may want to apply your changes to a different sample, perhaps searching for a common, well-known error in the data to see if you managed to correct it. Then — deep breath — it’s time to apply changes to the full dataset once you are relatively confident in the results.
Unlike the instant feedback from working on smaller samples of the dataset, this process might need you to take a coffee break and come back for the output. Once the result has been generated, reviewing the summary statistics on the output can be helpful as a final validation check. On the chance that this review surfaces some previously undiscovered issues, you would pull up those specific errors and take steps to fix them, further refining your work. These iterations are unavoidable in messy Big Data work – the key is to cycle through them systematically and efficiently to arrive at the final results.
Here is Big Data demystified – scaling up data volumes need not be overwhelming, but does need a different approach to be handled effectively. Cleaning and preparing larger datasets is impossible with line-by-line value checks. Instead, it involves crafting systematic rules refined over multiple iterations. Using the right slice of data to answer the right questions at each stage lets business users cycle through these iterations quickly, by themselves, and in time to meet the analytic needs of today’s business decision-makers.
Radiant Advisors, an independent research and advisory firm, outlined the importance of data sampling in their insight paper. Download it to see why robust data sampling is critical to any data wrangling solution.