Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

BigQuery Optimization from Trifacta
Behind The Scenes

July 29, 2021

This is the second blog post in a series on how Trifacta integrates with BigQuery to achieve performance gains with data transformation using SQL queries from Trifacta recipes. In the first part of this series, we discussed how BigQuery Pushdown from Trifacta leverages the scale and efficiency of cloud data warehouses to process data quickly, efficiently, and at scale. In this second part, we take you behind the scenes, open the covers on the technical aspects of BigQuery Pushdown, and dive deep into how you can translate recipes from Trifacta into SQL queries.

A recap of key concepts

Let’s start by reiterating some of the key concepts in Trifacta, the Data Engineering Cloud.

  • Recipe: A natural language representation of the steps to transform data, which are executed by a processing engine.
  • Transform: An internal representation of a recipe. A transform does a single pass on the entire dataset to execute a single operation.
  • Graph: A representation of multiple operations that are connected. The connection determines the order of execution of each operation on the graph.
  • Cross-section: A pair of consecutive steps with a parent-child relationship.

Trifacta-BigQuery Integration

Trifacta’s integration with BigQuery for full SQL pushdown can be listed as a series of steps, outlined below. We will then discuss the details of each of these steps to help you understand how you can achieve efficiencies using Trifacta’s innovation with pushdown optimization.

  1. Recipe representation in a graph: A set of transformation steps (the recipe) will be performed on a given dataset and represented in a graph.
  2. Walk the graph through every level: Specific operations are executed including column pruning, deleting redundant operations, swapping any applicable filters, and restructuring the graph. Additionally, if possible, you will be able to generate a SQL query using Apache Calcite. As you walk the graph, it might be valuable to repeat any relevant steps to avoid forever loops.
  3. Generate profiles and statements: If both the source and the destination are in BigQuery, you can generate Data Definition Language (DDL) statements to create new tables for starring results and generate a profile based on the final result.

Now, let’s peel the layers to understand the technical intricacies of these steps.

Recipe representation in a graph
Trifacta recipes are created with hundreds of available functions, which are internally mapped to a set of transforms.

This can be represented in the form of a graph that shows the order and dependence of operations. Each node represents a single function and can be correlated with the recipe.

Walking the graph

Now that we have a graphical representation of the recipe, let’s walk through each section. This step helps us move forward towards our goal of creating a SQL script from the recipe.

 

select SUM(volume)

from (

  select 

    datetrade, 

    ticker, 

    volume 

  from tbl

  where 

    datetrade is not null

)

group by ticker, datetrade

Let’s start from the source table and walk the graph cross-section-wise. At each cross-section, we apply a transformation step to reduce and simplify the graph. The representation below shows a dotted box that represents a cross-section. We iterate over each cross-section to perform a transformation. Each cross-section presents a parent-child relationship, where the parent is modified depending on the status of the child. This helps make multiple iterations on the same graph and incorporate changes from all children into the parent for a comprehensive SQL script.

Applicable transformations at cross-sections

Column Pruning

We track columns from the parent at each cross-section and track those that are not being used by the child to surface it towards the load step. When we are confident that a particular column is not needed, it is safe to prune that column from the load step and reduce the size of the data that needs to be scanned at each transform.

Transform Absorption / Pushdown

This operation is performed for the cross-section which has a load step as the parent. In this process, we try to absorb the transform to load and generate an equivalent SQL query as in the examples below. The examples show a case for each type of child.

Map pushdown – absorbing a map operation

 

Filter pushdown – absorbing a filter operation

Join pushdown followed by Aggregate pushdown

Filter Swap / Filter Pushdown

We now illustrate an example where a filter operation is capable of significant elimination of rows (~70%), from the previous step. If step 1 spends 5 minutes processing data from the load step and it is seen that 70% of the rows would be filtered out in step 2 and does not depend on step 2, the graph can be rearranged to filter before step 1. This will reduce the time from 5 minutes down to 1.5 minutes. It is recommended to keep filter operations at the top of the recipe to remove a lot of unnecessary data for subsequent operations. This job optimization is called filter pushdown or filter swap.

Supported operations for full execution

Full execution depends on whether we can translate every single transform and function from Trifacta to a SQL equivalent with all the variations that we support around it. Most of the time, there is a direct mapping as in the case of functions such as UPPERCASE, LOWERCASE, etc. In some cases, functions cannot be translated as 1/ it is not possible to express them in BigQuery (eg. WORKDAY_INTL), 2/ SQL may be capable of partially supporting it (eg. timestamp parsing), or 3/ Trifacta’s implementation differs from BigQuery’s (eg. PROPER_CASE because of Unicode support). In such cases, we may be left with unsupported functions and the entire graph may potentially not be condensed to a SQL query. However, we can still absorb a lot of operations and have the job run on Google Cloud Dataflow. This is called Hybrid Execution and the job is optimized to a reduced level of ingestion from BigQuery to the batch-processing pipeline.

Summary

The performance and efficiency gains observed with Pushdown Optimization on Google BigQuery have been encouraging and we have been able to achieve similar efficiencies on other cloud platforms such as Amazon Redshift, Azure Synapse Analytics, and Snowflake. We’re excited to see you achieve the same efficiencies on your datasets with this capability on the cloud data warehouse of your choice.

Stay tuned as we share many more such innovations in our journey towards faster, better data engineering. Want to get behind the wheel and drive towards your destination of usable data with performance gains? Try Dataprep by Trifacta for free today.