Wrangle Summit 2021 On Demand

You can still experience the best people, ideas and technology in data engineering, all in one place

Get All-Access Pass
All Blog Posts

Pushing the Boundaries of Performance:
On-the-Fly Data Wrangling at Scale in v4

September 27, 2016

At Trifacta, we’ve always strived to deliver the most immersive, interactive data wrangling experience. With an unmatched visual interface and real-time feedback, we encourage users to explore and discover new insights by leveraging the full context of their data. Now, we’re excited to announce the general availability of Photon, Trifacta’s high-performance data wrangling engine, which pushes that bar even higher. Photon enables faster feedback on greater volumes of data, which leads to huge productivity gains for all of our users.

Forging Our Own Path: Building a Unique Compute Framework

Before building Photon, we asked ourselves whether one of the existing modern data engines could meet our requirements—without doubt, MapReduce, Spark, Flink, and Google Dataflow all present innovative, high-quality frameworks for transforming data. However, none of these tools were built explicitly for our purpose: dealing with dirty data. The messy datasets our customers wrangle provide some unique challenges, including 1) lots of raw string processing and 2) noisy data that leads to ambiguous types and schemas; and 3) many complex transformations.

In addition, while we sought out sub-second interactivity, these engines can have additional and unpredictable latencies due to garbage-collection, just-in-time compilation, and data storage formats not suitable for unstructured data. Finally, to eliminate all sources of delay, we require that some computation be done on the user’s own machine, “client-side”. This imposes an additional constraint on memory usage that these big data, distributed engines need not strictly adhere to.

In summary, while these tools thrive in their own domains, we needed to build the right tool for our specific job, in addition to leveraging those tools where appropriate, such as Spark for our distributed batch execution framework.

Photon: Built from the Ground Up to Tackle Dirty Data

So what’s Photon? It’s our state-of-the-art, low-latency (low turn-around time), low-memory engine built from the ground up to deal with dirty data. Our engineering team’s studied the best academia and industry has to offer, including Apache Impala, HyPer (now Tableau), Tupleware, Apache Spark and Apache Arrow and mixed those ideas our own to build the fastest data wrangling engine commercially available.

Photon in Application

Photon maps the domain-specific language (DSL) used in Trifacta’s innovative interface to low-level machine code using C++ and LLVM compilation. It takes advantage of modern computer architectures by using data locality, multi-threading, single-instruction multiple data (SIMD) instructions and thread locality to execute at “light speed”. Its explicit memory and thread management ensures the fast, predictable execution required for our unique wrangling experience. Photon’s architecture allows it to execute on a single server for large data, in-memory wrangling and on the client, in-browser using Portable Native Client (PNaCl).

All of this comes together to allow orders of magnitude more data wrangled with an even more responsive, interactive experience that we’re excited to deliver to our users with Trifacta v4.


While we’ve come a long way to deliver this first version of Photon, an engine can always perform better. We’ll continue to optimize Photon to provide improved interactivity and scale with ever more data. Also, with this new tool in Trifacta’s toolbox, we’ll surely find new and innovative ways to present data visually and interactively, to continue to keep our users wrangling productively.

For more more information on Transform Builder, as well as the rest of the features included in v4, read our official press release and blog post.