Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

Data Wrangling at the Speed of Photon

May 16, 2017

In collaboration with ESG Research we’re excited to share the latest performance results of our purpose-built, in-memory data wrangling engine, Photon. We’ve been developing Photon at Trifacta over the past 3 years to realize a key part of the architectural vision of our founders Joe Hellerstein, Sean Kandel and Jeffrey Heer. We released the first version of Photon over a year ago and have continued to enhance its functionality, improve its performance and more tightly integrate it in our unrivaled interactive data wrangling solution.

Why Photon?

We’re often asked why we chose to build our own engine instead of leveraging one of the many excellent alternatives: MapReduce, Impala, Spark, Google Dataflow and Flink. To provide users with an immersive experience with immediate feedback, we need to minimize the time spent between crafting a transformation and seeing a result. This requires that the engine run “client-side”, in the browser, to eliminate the latency incurred by moving data to and from a server.

Building our own engine also allowed us to optimize for the most prevalent challenges our users face when wrangling their dirty data: processing raw strings, handling noisy data with ambiguous types and creating many complex transformations.

Lastly, not all data is “big data” and many use cases only require a single computational node. Rather than a one-size-fits all approach, our Intelligent Execution architecture chooses the best execution environment for the given task. This could mean executing Photon jobs on a single node or Spark jobs in parallel over a large cluster. Photon avoids some of the additional, sometimes unpredictable, latencies of garbage-collection, just-in-time compilation, and ill-suited internal data storage formats to gain an edge over other general purpose engines.

Value to Trifacta Users

As you’ll see in the performance results, Photon executes extremely fast in comparison to Spark for wrangling jobs that fit on a single node. This should come as no surprise given that Photon was built for exactly this workload. It’s important to note that Spark plays a key role in Trifacta’s architecture and is leveraged to run large-scale distributed workloads.

While we couldn’t compare in-browser execution with any other engine, this demonstration of Photon’s execution speed on a single node translates readily to the web user interface. There, it improves user productivity by providing fast feedback and guiding them visually as they build their transformation pipelines.

What’s Next for Photon?

While we’re happy with Photon’s initial performance numbers, there remains room to improve. We’ll continue to work on its algorithms and data structures to speed up execution and identify and remove bottlenecks that may diminish its multi-threaded efficiency.

Thus far we’ve leveraged Google Chrome’s PNaCl component to run Photon on the client machine. However a recent W3C standard, WebAssembly, has been developed to run native code in web browsers. The effort has wide industry support including Google, Firefox and Microsoft and we’ll be using the new technology to deliver even faster integration between Photon and our interactive web user interface.