As a leader in data engineering, one of the key focus areas for Trifacta is to play a key role in the modern data stack. Cloud data warehouses are moving from traditional ETL to modern ELT architectures and Trifacta plays an important role by enabling the “T” in ELT with advanced data transformation, at the same time empowering the aspects of “E” with extracting and “L” with loading data wherever required.
Faster Data Engineering with Pushdown Optimization
As part of our ongoing innovations, we enabled acceleration with data engineering tasks with “Pushdown Optimization” a capability we introduced earlier this year. This innovative approach harnesses the power of cloud data warehouses using SQL queries for ELT tasks. With Pushdown Optimization, workloads can be pushed down to the source while leveraging the scale and compute of these cloud data warehouses. We released this capability on leading cloud platforms including Google BigQuery, Amazon Redshift, Azure Synapse Analytics, Snowflake, Oracle DB, Microsoft SQL Server, and PostgreSQL.
In this first part of a 2-part series, we’ll discuss Trifacta’s integration with BigQuery to push down data preparation recipes intelligently, resulting in significant performance gains. The second part will dive deep into the technical details of the solution.
With this new capability, the amount of data extracted from different sources (for both columns and rows) is transformed by the power of the Data Engineering Cloud from Trifacta. This results in significant performance gains, especially when combining large volumes of data distributed across systems, transforming them, and pushing them down to the underlying database. The gains could be even more substantial if the data is located in a single cloud data warehouse or a data lake. As an example, we can achieve faster data engineering by up to 15x for data located in BigQuery down to ~20 seconds down from 5 minutes, showing the power of full SQL pushdown.
Dataprep by Trifacta on Google Cloud
Let’s start by introducing a few fundamental concepts when you use Trifacta. Trifacta lets you import your datasets into a flow, define data transformation steps or rules in a recipe that is executed by a processing engine through a job, and publish the results to a destination of your choice. You can learn more about Dataprep by Trifacta here.
Trifacta provides multiple choices for the underlying processing engine for your datasets. These choices enable you to choose the best engine depending on the size of your dataset, the location where the data resides including whether it is centralized or distributed, and your use case for speed and efficiency. By default, Dataprep by Trifacta on Google Cloud leverages Google Cloud Dataflow to process the underlying data. However, if your data is already present in BigQuery, you may want to leverage BigQuery as the processing engine to reap additional performance benefits. An example is shown below summarizing both Dataflow and BigQuery.
Now, let’s dive deeper and look at a detailed example with BigQuery.
Here, both the source and destination are BigQuery tables. The source consists of 3 columns and 2 million rows. If 4 operations are performed in a single row, we’ll end up with 8 million iterations. If we take all the columns under consideration, we’ll have a staggering number of 24 million operations. In such cases, it is more efficient to use BigQuery as the processing engine to effectively use compute resources for faster data processing.
Data processing at scale
Traditionally, large-scale data processing is achieved with the help of batch processing systems such as Apache Spark, Google Cloud Dataflow, Apache Beam, and others. These systems are designed to support large amounts of data spread across distributed systems from a wide range of data sources. Trifacta offers ubiquitous connectivity to your data with Universal Data Connectivity, catering to a wide range of use cases.
If your data is in cloud data warehouses such as Big Query, traditional batch processing systems need to stream data from the warehouse, run map-reduce-filter operations, and finally pipe the output to a destination. This induces delayed processing and potential performance inefficiencies. This can be overcome by performing all computations within BigQuery rather than a separate batch processing system by using SQL queries within the data warehouse. Since the data is local and the format such as columnar storage is suitable for large-scale processing, the data can be transformed to its desired output quickly, efficiently, and at scale. As an example, BigQuery can scan 20 TB of data (~35 billion rows) as fast as 10 seconds!
BigQuery Pushdown from Trifacta
The goal of BigQuery Pushdown is to leverage the local data within a cloud data warehouse and the large-scale processing capabilities of BigQuery. Let’s look at how a recipe (a series of data transformation rules) from Trifacta can be converted into a SQL query, using the same example from the previous section.
A common SQL query for this example would be:
CREATE TABLE PRICES AS ( SELECT UPPER(product), price * volume as total FROM inventory WHERE product <> NULL OR product <> "" )
Here, it is worth noting that many scalar functions can be directly mapped to a SQL function using Trifacta. A few examples are shown below.
Additionally, text processing and data cleaning operations involve regular expressions. Trifacta makes it simple to abstract regular expressions as a single-step built-in function with its simple, yet powerful interface. For example, if you want to split a column “col” three times to create 4 columns, it would be split as shown below.
If one has to accomplish the above thing with Trifacta, we need the recipe step below which is processed by batch-processing engines like Spark and Dataflow where we have the UDF for SPLIT implemented.
SPLIT col with text on separator “_sep_” 3 times
If pushdowns are enabled, Trifacta attempts to convert functions to their SQL equivalent before running the job. If the particular function can be successfully translated to SQL, the execution happens on the SQL database (in this case BigQuery).
The BigQuery compatible SQL query to employ the same logic as SPLIT would be as follows:
SELECT c3, -- if no match, then take entire string, else take till the first match IF (s1 = 0, SAFE.SUBSTRING(c3, 1), SAFE.SUBSTRING(c3, 1, s1 - 1)), -- if no prev match, then null, else take till the next match if possible, else null IF (s2 = 0, SAFE.SUBSTRING(c3, NULLIF(e1, 0)), SAFE.SUBSTRING(c3, e1, s2 - e1)), IF (s3 = 0, SAFE.SUBSTRING(c3, NULLIF(e2, 0)), SAFE.SUBSTRING(c3, e2, s3 - e2)), -- if no prev match, then null, else take till the end of the string SAFE.SUBSTRING(c3, NULLIF(e3, 0)) FROM( SELECT c3, REGEXP_INSTR(c3, r"_sep_", 1, 1, 0) as s1, REGEXP_INSTR(c3, r"_sep_", 1, 1, 1) as e1, REGEXP_INSTR(c3, r"_sep_", 1, 2, 0) as s2, REGEXP_INSTR(c3, r"_sep_", 1, 2, 1) as e2, REGEXP_INSTR(c3, r"_sep_", 1, 3, 0) as s3, REGEXP_INSTR(c3, r"_sep_", 1, 3, 1) as e3 FROM (..., c3, ...) )
As you can see from these examples, Trifacta with BigQuery Pushdown Optimization empowers the ELT architecture and helps you wrangle your data easily, quickly, and efficiently. This innovative capability enables intelligent execution of your data when both the source and the destination are the cloud data warehouse and leverages the scale and processing power of cloud data warehouses such as BigQuery. In the next blog post of this series, we will take you behind the scenes, lift the covers, and reveal the technical details of this innovation. In the meantime, start today with our free trial.