Agile data transformation requires fast iteration: Assess the current state of your data, transform it to move closer to your desired end state, repeat. Accelerating that interactive loop dramatically accelerates the overall time it takes to build data pipelines.
Executing transformations is a major factor in the latency of this loop. Transforming data requires computation, and with the shift from ETL to ELT, increasingly this computation takes place in the cloud platform, leveraging the speed and flexibility of cloud data warehouses and lakes.
In order to make data transformation as fast as possible, we have introduced our Photon engine on all editions of Trifacta. Previously, Photon was only available on our Trifacta Enterprise edition, but now we have made it available on all editions including our 30-Day free trial. Photon is an in-memory running environment for running jobs. Embedded in Trifacta SaaS, Photon execution engine is fast and best-suited for small to medium-sized jobs.
When you choose to run a job, you can now choose to run a job on Trifacta Photon. By default, Trifacta SaaS specifies the most appropriate running environment based on the size of your job.
How does it work?
At Trifacta, we take advantage of cloud computation by translating transformations customers build into code that can run in a variety of environments. As an example, we will generate code that runs in Spark or Dataflow, and increasingly push transformations into Cloud Data Warehouses (CDW) like Snowflake and BigQuery when transforming data in the warehouse.
In the past we’ve also written about the advantages of performing computation in the browser itself. After all, a lot of iteration takes place in the user interface itself – instead of fetching data from a server after each change, you can explore your data (or samples of larger data sets) in real time in the browser. To support fast transformation in the browser, we implemented Photon, an in-memory engine. Photon transformations in the browser run quickly but are limited to data volumes that could reasonably be loaded into the browser.
Today we are excited to announce that the same technology backing Photon is now available on the server! You can now leverage the speed of Photon but on much larger data sets. Additionally, you can run photon jobs through our built-in scheduler or via API if you are leveraging other systems for orchestrating workflows such as Airflow.
Photon is optimized to perform low latency transformations over both structured and unstructured data for small and medium sized datasets (on the order of gigabytes). For large data files we still recommend using our compilation to spark or dataflow jobs; for data in the CDW we increasingly recommend leveraging our optimizer to push workloads into the CDW. But if you want fast iteration on a large spreadsheet or log file, give Photon a spin!
What to see Photon in action? Start your free trial of Trifacta.