Since the product’s early days, Trifacta has leveraged Hadoop for large scale batch processing. Trifacta supports several types of batch workloads. The most prominent is transformation, in which we take the wrangling steps our customers author and execute them on their data. Another such workload is profiling. Automatic profiling of data is one of the most powerful features of Trifacta, and we have heard time and time again about its value to our customers. Profiling gives them a summary view of their data, including (value distributions, extreme values, and sample anomalies., etc.) This post describes how Trifacta has adapted to the challenges of profiling. We wrote about profiling on our blog a year ago, and CSO Joe Hellerstein and Director of Development Adam Silberstein presented at Strata+Hadoop World 2015 to show how it is not as simple a task as it used to be. With increasing volumes of data and numbers of features, if not carefully implemented, profiling can be very expensive, sometimes running many times slower than customers’ composed transformation jobs.
In Trifacta’s original architecture, profiling was compiled alongside transformation into a single Pig script and executed as a MapReduce job. While making only one pass over the data is efficient, it does mean transformation and profiling progressed in lockstep, so both results are available only at the end. In order to get customers their data as fast as possible without losing the insights from automatic profiling, we split profiling from the transformation job. Dual jobs lets us deliver transformation results faster and gives us extra data processing flexibility; in fact, we need not use the same engine for each job!
This new flexibility triggered our investigation to find the most performant options for profiling. Profiling jobs have slightly different constraints than transform jobs. While transform jobs are primarily row-wise operations, profiling jobs look more like OLAP workloads, and include lots of aggregates across multiple dimensions. This can be very expensive when framed in the context of a MapReduce job. Thankfully, the open source community developed a number of newer engines specifically targeting these workloads; out of these, we chose Spark. It offers high performance through memory residence, which allows for multiple passes over the data without additional IO cost. This new profiler exists as an option right alongside the Pig profiler, still available for those customers who have not yet adopted Spark.
In a side by side comparison of our Spark and MapReduce profiling solutions, we saw an average of a 3X speed up for Spark. Gains were largest for datasets with many columns, and grew linearly with the size of the data. For example, a 10GB dataset with 50 numerical and 50 categorical columns took almost an hour with MapReduce, and took less than 15 minutes with Spark. When we presented our Spark Profiler to our customers, the improvements we saw were even greater than those we saw with our internal benchmarks, sometimes up to 10X faster!
With such positive results for Spark, we decided to build a RESTful Spark job server that takes a profiling specification and either stores its results or streams them directly back to the caller, allowing us to deliver these performance gains straight to our users. Not only do users no longer have to wait for the profiling job to complete to see their job results, but the profiling results are available faster as well. Beyond that, our Spark server opens the door for more detailed profiling in the future, with the option of on-demand drill down by column. Welcome to Spark profiling!
Sign up for free Trifacta Wrangler to experience the value of automatic profiling for yourself.