Join us on April 7-9, 2021

The first industry event focused on data engineering

Register Today
All Blog Posts

How to Extend Cloud Dataprep by Using BigQuery Javascript UDFs

October 12, 2020

Since Trifacta is a data company, we try to be as data-driven as possible. This means that product usage data analysis informs many of our product, sales, and marketing decisions. Our team has built our usage data pipeline entirely in GCP, so we can use both our own technology (Google Cloud Dataprep) and native GCP services like BigQuery to drive our analysis.

This article will explain how you can combine Cloud Dataprep with BigQuery Javascript UDFs to extend the scope of data preparation work that you can accomplish natively in Cloud Dataprep. I’m going to frame this how-to around a usage data question I recently answered…if you don’t care about the background behind the usage data problem and just want to get to the instructions, scroll down to the section titled “How it works”.

If you don’t already have access to Cloud Dataprep, and you want to try this yourself, you can sign up here.


A flow with a single DAG

Recently, our head of engineering asked me to provide him with some information around how many of our users are building “disconnected DAGs” in their flows. Trifacta’s flow view implements a form of directed acrylic graph. Users can either create a 1:1 relationship between DAG and flow, as shown in the image below:

Or users can create a many:1 relationship between DAG and flow, as shown below:

A flow containing 3 DAGs

Our design team has been focused on improving Trifacta’s Flow View (for details on the first iteration of the redesign, you can check out this blog), so understanding how many customers build multiple DAGs in a single flow would help guide future development.

How it works

1. Understand the source data format and required workflow

In BigQuery, the base table that I need to use for this analysis is called “flowedges”. This table reports at the granularity of the flow edge, and maps pairs of input and output nodes to an overall flow ID. The table roughly looks like this:

Sample of data from the flowedges table

Notice that the “flowedges” table doesn’t explicitly identify each separate flow DAG — in order to determine if a flow contains one DAG or many DAGs, you need to traverse the flow graphs and assign each flow edge pair to the appropriate DAG group. Doing this graph traversal requires implementing breadth-first search.

Looking at my GCP toolkit for preparing this data, I have two choices:

  1. Cloud Dataprep
  2. BigQuery

If I think about the format I need to transform my data into, I plan to create an array of arrays as my initial data structure. This will change the granularity of my dataset so that each record represents a single flow, and one column contains all of the pairs of flowNodes associated with that flow. The array structures will allow me to iterate through the pairs using the breadth-first search method and assign pairs to groups.

My initial source to target data transformation

Cloud Dataprep supports nested array operations, but BigQuery doesn’t. So I need Cloud Dataprep to form the foundation for my data preparation work. However, the breadth-first search function that I need to apply is rather esoteric and isn’t available as a native function in Cloud Dataprep. Since BigQuery supports user-defined functions, I’m going to use BigQuery to create the breadth-first search function that I will then invoke in Cloud Dataprep.

My full data preparation pipeline, including both Cloud Dataprep and BigQuery, will look like this:

  1. Transform initial data into an array of arrays using Cloud Dataprep, and stage that table in BigQuery.
  2. Create a Custom SQL dataset in Cloud Dataprep that invokes the BigQuery Javascript UDF.
  3. Transform the result of custom SQL dataset using Cloud Dataprep to count the DAGs per flow.
  4. Write output back into BigQuery.

Once the data is in BigQuery, I can push it into DataStudio for visualization, but I’m not going to show that piece of the pipeline in this article.

2. Write the BigQuery Javascript UDF

Full disclosure: I am not a computer programmer by trade. So although I can code, I much prefer to search StackOverflow and find examples of similar problems that other people have already solved. Luckily, I surfaced this thread ,which gave me some really good insight (and base code for my Javascript UDF) into how to traverse a series of edges and assign them to groups.

Armed with this Javascript, I tweaked the code slightly and converted it into a BigQuery Javascript UDF. I decided to create a persistent UDF, because after speaking with other colleagues, it seemed like this function could have applications beyond my single use case. In BigQuery, persistent UDFs can be invoked by anyone with appropriate permissions on the dataset where the persistent UDF has been saved.

I followed the Google documentation for creating Javascript UDFs. You can reference the documentation here. My final Javascript UDF, called “groupDags”, looks like this:

This function produces a JSON array that contains a separate array for each individual DAG. Using our example from step 1, this UDF will produce the following output:

Input and output from the Javascript UDF

Notice that flow 1 has 2 DAGs, and flow 2 has only 1 DAG.

3. Prepare the source data in Cloud Dataprep for using the UDF

Now that I’ve created my BigQuery UDF, it’s time to create my initial transformed table, which will contain an array of arrays.

In Cloud Dataprep, I first nested the inputFlowNode and outputFlowNode columns into a single array using the “Nest columns into Objects” transform:

And I then aggregated this object back into a list of pairs using the “Group By” transform:

As a final step, I wrote this output into BigQuery as a table called “flowdags”.

4. Create a custom SQL dataset in Cloud Dataprep to invoke the UDF

Now that I’ve created my intermediate table in BigQuery, I can create a custom SQL dataset in Cloud Dataprep to invoke the UDF. Custom SQL datasets allow you to write any valid SQL statement and push the execution of that statement down to the database layer. You can then use the output of that SQL statement like a view, and perform additional transformations in Cloud Dataprep.

To create a custom SQL dataset in Cloud Dataprep, I’ll navigate to the “Import Data” page and click on “Create Dataset with SQL”:

This will open the custom SQL editor. I can simply paste my custom SQL statement into this editor. In the screenshot below, you can see that I’ve invoked the persistent BigQuery UDF that I saved under the name “groupDags”:

The output from this query, including the result of the UDF, will become my new source for any further transformations in Cloud Dataprep.

5. Perform final transformations on the custom SQL dataset and write output to BigQuery

After my UDF has been invoked, I now have a two-column dataset. One column contains the flow ID, and the other column contains the result from the groupDags UDF. All I need to do at this point is count the number of DAGs in each flow. I can use the native “ARRAYLEN” function in Cloud Dataprep to do this:

And as a final step, I can write the resulting table directly into BigQuery. My head of engineering can either query that resulting table directly, or use a downstream tool like DataStudio to create visualizations and reports.

More applications for BigQuery UDFs and Cloud Dataprep

In addition to my usage data project, some other potential applications for invoking BigQuery UDFs in Cloud Dataprep Custom SQL dataset could include the following:

  • Applying advanced text-matching or similarity algorithms (think soundex, Jaro-Winkler, etc)
  • Using NLP to pre-process text columns
  • Enriching data with the results from an API call
  • …or more!

Additional Cloud Dataprep content

My colleague, Victor Coustenoble, also writes pieces about Google Cloud Dataprep. Check out some of his work: