At Trifacta we’re trying to help our customers wrangle data as quickly and easily as possible, from messy point A to clean point B where data is structured, normalized, and ready for analysis. The only way our users can efficiently get their data into shape is if we can provide feedback along the way that shows them the effects of each transformation step. As a simple example, consider a user wrangling car data and generating a column of prices. Without prompting, Trifacta provides a histogram over these values to show that most cars cost between $10K and $50K; if the histogram also shows a spike of $1,000K+ cars, a user might want to investigate their script or dataset some more—perhaps the decimal point shifted in some entries. The histogram gives the user a concise and intuitive summary of car prices that avoids the arduous and mistake-prone task of eyeballing perhaps millions or more output rows. We call histograms and similar visual or statistical summaries “profiles”.
As a design pattern, it is handy to think of wrangling as a linear progression of discovering, assessing, shaping, enriching, and distilling data prior to analysis. Reality is not quite so linear, and the wrangling timeline contains loops: every time the analyst finds anomalies or produces unexpected data, they must step back to understand and reconcile the problem. Data profiling is the process that clues analysts into these issues and, after revision, eventually convinces them that the transformed data is ready for use. Each stage of wrangling necessitates different types of profiling. At the beginning of discovery, an analyst initially just cares about extracting the collection of relevant data objects in a file; this data structuring task can be done cheaply with a sample of bytes taken from the head of the file. Near the end though, before signing off on their output, the individual working with the data might want detailed profiles over every aspect of the entire dataset, a potentially heavyweight operation.
At Trifacta, we focus on 3 critical factors and how to trade them off at different points in the wrangling timeline:
- Time taken to produce profiling results
- Robustness or accuracy, of the profiling result
- The volume of data being profiled
Our philosophy is to optimize for profiling speed early on in the wrangling timeline so the user moves quickly, but as wrangling progresses we pay the compute cost—often in the background—to deliver richer and more robust profiles. For small data volumes, our platform doesn’t need to choose between speed and robustness. For truly big data, our Hadoop support enables detailed profiling at scale. The tradeoffs along these three critical factors have been a key aspect of the design of Trifacta throughout the user experience, so that users get good profile information when they need it.
For data profiling, it’s often acceptable to use approximation techniques to trade off time for accuracy. Two powerful and broadly applicable estimation techniques that we use for profiling are sampling and sketching. Both bring us a lot of bang for the buck, with sampling letting us avoid full scans over data and sketching allowing us to summarize data with a small memory footprint.
On Feb 19th at Strata San Jose, Joe Hellerstein and I will be giving a talk on Agile Data Profiling in the Big Data Era, covering profiling not only in the context of wrangling, but more generally across the analytic lifecycle. We’ll teach you how to bring these techniques to bear on your own projects and provide more detail on how we power Trifacta’s various profiling features.