See How Data Engineering Gets Done on Our Do-It-Yourself Data Webcast Series

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

Making the Most of Your BigQuery Investments for Scalable Data Engineering Pipeline

October 11, 2021

When we released BigQuery Pushdown for Dataprep on Google Cloud back in April, we knew that it was a highly anticipated ELT (Extract Load & Transform) feature that would help both design time and processing time. However, we did not expect it to be adopted so quickly. Our internal benchmark of 20x job acceleration was crushed by some customers reporting gains that were over 500 times faster than the legacy version of Dataprep (pre-Oct. 2020). This data pipeline execution gain is one of the primary reasons why organizations are moving to Dataprep Professional edition.

With the latest release of Dataprep, we are adding a myriad of new BigQuery capabilities on top of Dataprep BigQuery Pushdown to enhance your user experience through better data engineering design and faster data pipeline execution speed.

Dataprep BigQuery Pushdown 101

When data are already localized in BigQuery, at job runtime, Dataprep turns the data-wrangling recipe into SQL statements executed directly inside BigQuery, effectively avoiding data movements in or out of BigQuery.

Most of the data preparation steps authored in Dataprep, such as joins, unions, filters, aggregations, windowing, pivoting, binning, date or string manipulations, and other forms of calculation, are turned into SQL language to transform and clean the data. When the SQL language doesn’t support a particular Dataprep transformation function (yes, Dataprep actually supports additional transformations compared to the SQL language, including comprehensive array manipulation, advanced string transformation, unpivoting, json structuring, and fuzzy matching), then the transformation is realized through JavaScript user-defined functions (UDFs) executed inside BigQuery. Therefore, in all cases, the data do not leave BigQuery. 

Thanks to this approach, you can trust Dataprep to leverage BigQuery to its best and boost the execution of your data pipeline. This is also a fantastic productivity gain in terms of generating and maintaining highly complex SQL queries with the low-code/no-code Dataprep visual interface. Data engineers and data analysts leveraging Dataprep speak the same language when it comes to cleaning and preparing data for analytics.


What Is Coming Up for Google Next ’21?

Ingesting Files into BigQuery with Pushdown

A primary requirement for anyone using Google Cloud BigQuery for analytics is loading and transforming data contained in files into BigQuery. Well, BigQuery offers an interesting capability with its external tables, which allow you to reference a data structure outside of BigQuery and make it act like a BigQuery table. This is however easier said than done. You need to write an extensive SQL statement to declare the data structure and then create and load the table, using a SELECT statement to apply the transformations.

1.) Create an external table from the GCS file

CREATE EXTERNAL TABLE <ext_table>

OPTIONS

(

        format = ‘csv’,

        delimiter = ‘,’,

        …,

        uris = [‘<path_to_gcs_files>’]

 2.) Transform the data contained in the external table with a SELECT statement

CREATE TABLE <dst_table> AS

(

        SELECT …

        FROM <ext_table>

 

)

The beauty of Dataprep is that it generates these complex SQL statements for you with a no-code approach, radically simplifying your data engineering work. This release supports a wide variety of formats, including csv, tsv, json, plain text, logs, and even compressed files. It also permits parameterized jobs (when the file’s name is only known at runtime). If you need to combine (join, union, lookup) BigQuery tables and files in the same data pipeline, it will do this in a few clicks for maximum productivity gains.

BigQuery Pushdown for Sampling

Sampling is a necessity for self-service data preparation in the case of large data sets. When visually exploring and cleaning data within a browser, there’s no way the browser can host millions of rows in its memory. For this reason, Dataprep utilizes sampling techniques to present only a significant subset of the data so that the user can author and validate data transformation recipes. Usually, one will slice and dice the data, re-sampling them as necessary to refine and verify the transformation steps. Now, suppose your data already reside in BigQuery. In that case, Dataprep can sample the data within BigQuery, returning results in a few seconds instead of minutes for extra productivity gains in your wrangling. 

Pre- & Post-SQL Data Management

If your data lake, data warehouse, or data mart resides in BigQuery, there is a good chance that before or after each refresh of your data repository, you may want to apply some specific customized tasks to your BigQuery analysis. This includes, but is not limited to, enabling and disabling indexes, validating queries, managing partitions, checking results, refreshing variables, or updating log entries into a database of your choice. In the spirit of data pipeline simplification and automation, you can run SQL statements (BigQuery but also JDBC for any other database) within the Dataprep user interface before data ingestion and after the data’s publication as a table. You can add multiple SQL scripts to an output object, add parameters within statements in each script, and dynamically pass different values for each run.

View Outputs in BigQuery

When your data engineering pipelines ends up loading a BigQuery table, it is quite usual to want to run a few SELECT statements in the database SQL editor to check the data’s behavior. In Dataprep, as soon as the data are loaded into BigQuery, you can open the BigQuery editor directly in the Google Cloud Console. This is another nice feature that makes data engineers’ and data analysts’ lives and project delivery simpler thanks to fewer clicks.


Refreshing the Data Warehouse with Upsert

A leading principle of analytics, whether a data lake, data warehouse, or data mart, is keeping historical data and not altering the raw data so that an analysis can be performed at any point in time. However, keeping all that data can generate some noise, so data are often versioned or tagged as part of a particular batch that is easier to identify in a particular time window. There is also the technique known as a slowly changing dimension. A very convenient way to implement such a behavior is to insert a new row if the data do not exist in the table; otherwise, the existing row is updated. This is also known as upsert, and this is managed in Dataprep in the BigQuery output configuration screen. You have to define the columns to match the preexisting data in the table and then define the columns that need to be updated.

 

Addition of BigQuery Datatypes: Arrays

Dataprep extends array support in BigQuery by accounting for heterogeneous elements in arrays. Typically, arrays are of a fixed type (e.g., strings[.1] ). However, some arrays may contain a mix of data types and can also embed json structures. Through leveraging the advanced wrangling functions available in Dataprep for processing arrays and objects, it is possible for users to operate on such heterogeneous array formats and extract the needed elements from these arrays by indexing and parsing methods.

Looking Ahead

In the near future, Trifacta will introduce further enhancements to support any external table as a source for the pushdown. This will help expand pushdown use cases to Spanner, Cloud SQL, Cloud Bigtable, and other external sources supported by BigQuery.

Soon, end-to-end automation of more complex ELT flows will be possible, enabling selective pushdown to the BigQuery engine to be carried out automatically by the optimizer. For example, if source data originate from another cloud or on-premise storage device, then the extraction (E) will be done from outside of BigQuery, the raw data will be moved to Google Cloud Storage, and all transformations from there will be pushed into BigQuery as SQL statements. 

Trifacta will also support customer extensions for pushdown, including custom SQL, BigQuery SQL end-points and JavaScript UDFs.

In addition, data sampling will be carried out directly on BigQuery to take advantage of the scalable engine. If BigQuery is your data warehouse of choice, then Dataprep by Trifacta should be your primary go-to place to build your data pipelines and generate your SQL code.

If you haven’t yet tried all this BigQuery goodness, now is the time to give it a try!