Since Trifacta is a data company, we try to be as data-driven as possible. This means that product usage data analysis informs many of our product, sales, and marketing decisions. Our team has built our usage data pipeline entirely in GCP, so we can use both our own technology (Google Cloud Dataprep) and native GCP services like BigQuery to drive our analysis.
If you don’t already have access to Cloud Dataprep, and you want to try this yourself, you can sign up here.
Recently, our head of engineering asked me to provide him with some information around how many of our users are building “disconnected DAGs” in their flows. Trifacta’s flow view implements a form of directed acrylic graph. Users can either create a 1:1 relationship between DAG and flow, as shown in the image below:
Or users can create a many:1 relationship between DAG and flow, as shown below:
Our design team has been focused on improving Trifacta’s Flow View (for details on the first iteration of the redesign, you can check out this blog), so understanding how many customers build multiple DAGs in a single flow would help guide future development.
How it works
1. Understand the source data format and required workflow
In BigQuery, the base table that I need to use for this analysis is called “flowedges”. This table reports at the granularity of the flow edge, and maps pairs of input and output nodes to an overall flow ID. The table roughly looks like this:
Notice that the “flowedges” table doesn’t explicitly identify each separate flow DAG — in order to determine if a flow contains one DAG or many DAGs, you need to traverse the flow graphs and assign each flow edge pair to the appropriate DAG group. Doing this graph traversal requires implementing breadth-first search.
Looking at my GCP toolkit for preparing this data, I have two choices:
- Cloud Dataprep
If I think about the format I need to transform my data into, I plan to create an array of arrays as my initial data structure. This will change the granularity of my dataset so that each record represents a single flow, and one column contains all of the pairs of flowNodes associated with that flow. The array structures will allow me to iterate through the pairs using the breadth-first search method and assign pairs to groups.
Cloud Dataprep supports nested array operations, but BigQuery doesn’t. So I need Cloud Dataprep to form the foundation for my data preparation work. However, the breadth-first search function that I need to apply is rather esoteric and isn’t available as a native function in Cloud Dataprep. Since BigQuery supports user-defined functions, I’m going to use BigQuery to create the breadth-first search function that I will then invoke in Cloud Dataprep.
My full data preparation pipeline, including both Cloud Dataprep and BigQuery, will look like this:
- Transform initial data into an array of arrays using Cloud Dataprep, and stage that table in BigQuery.
- Transform the result of custom SQL dataset using Cloud Dataprep to count the DAGs per flow.
- Write output back into BigQuery.
Once the data is in BigQuery, I can push it into DataStudio for visualization, but I’m not going to show that piece of the pipeline in this article.
This function produces a JSON array that contains a separate array for each individual DAG. Using our example from step 1, this UDF will produce the following output:
Notice that flow 1 has 2 DAGs, and flow 2 has only 1 DAG.
3. Prepare the source data in Cloud Dataprep for using the UDF
Now that I’ve created my BigQuery UDF, it’s time to create my initial transformed table, which will contain an array of arrays.
In Cloud Dataprep, I first nested the inputFlowNode and outputFlowNode columns into a single array using the “Nest columns into Objects” transform:
And I then aggregated this object back into a list of pairs using the “Group By” transform:
As a final step, I wrote this output into BigQuery as a table called “flowdags”.
4. Create a custom SQL dataset in Cloud Dataprep to invoke the UDF
Now that I’ve created my intermediate table in BigQuery, I can create a custom SQL dataset in Cloud Dataprep to invoke the UDF. Custom SQL datasets allow you to write any valid SQL statement and push the execution of that statement down to the database layer. You can then use the output of that SQL statement like a view, and perform additional transformations in Cloud Dataprep.
To create a custom SQL dataset in Cloud Dataprep, I’ll navigate to the “Import Data” page and click on “Create Dataset with SQL”:
This will open the custom SQL editor. I can simply paste my custom SQL statement into this editor. In the screenshot below, you can see that I’ve invoked the persistent BigQuery UDF that I saved under the name “groupDags”:
The output from this query, including the result of the UDF, will become my new source for any further transformations in Cloud Dataprep.
5. Perform final transformations on the custom SQL dataset and write output to BigQuery
After my UDF has been invoked, I now have a two-column dataset. One column contains the flow ID, and the other column contains the result from the groupDags UDF. All I need to do at this point is count the number of DAGs in each flow. I can use the native “ARRAYLEN” function in Cloud Dataprep to do this:
And as a final step, I can write the resulting table directly into BigQuery. My head of engineering can either query that resulting table directly, or use a downstream tool like DataStudio to create visualizations and reports.
More applications for BigQuery UDFs and Cloud Dataprep
In addition to my usage data project, some other potential applications for invoking BigQuery UDFs in Cloud Dataprep Custom SQL dataset could include the following:
- Applying advanced text-matching or similarity algorithms (think soundex, Jaro-Winkler, etc)
- Using NLP to pre-process text columns
- Enriching data with the results from an API call
- …or more!
Additional Cloud Dataprep content
My colleague, Victor Coustenoble, also writes pieces about Google Cloud Dataprep. Check out some of his work: