Most of our customers who use Trifacta for cloud data warehouse use cases want to source tables from transactional systems that update with new records on a daily or weekly cadence. However, performing full loads of these systems to the data warehouse often leads to continually growing data volumes and increasingly long processing times.
In this article, you’ll learn how to use Trifacta’s source parameterization and scheduling capabilities to perform incremental data loads into your data warehouse. Incremental, or delta, loads enable you to apply your data preparation logic to only the most recent source records. This allows you to keep your data warehouse loading pipeline performant by only processing new data.
Our example will build a flow to process daily product usage events stored in BigQuery, and publish each day’s new records to a different BigQuery dataset.
1. Getting started
To get started, you will want to connect your Trifacta workspace to your transactional system. Depending on your Trifacta edition, the available relational connections vary. Refer to our documentation for instructions on how to establish a JDBC connection to your source system.
(If you don’t yet have a Trifacta account, you can sign up here for free: https://www.trifacta.com/start-wrangling/)
2. Identify your delta column
In order to perform an incremental load with Trifacta, your source data needs to have a column that identifies which records are new. In many transactional systems, this is a timestamp column that specifies when a record was inserted into the table.
Our example product usage table includes a column named “createdAt” which identifies when each individual record was created. This column contains a timestamp in the format YYYY-MM-DDTHH:mm:ss. The “createdAt” column is the column that we will use to identify the new rows that need to be incrementally loaded into our data warehouse.
3. Create a custom SQL dataset
Once you have created a connection to your source system, open the “Import Data” page. This is where you will create your logic to identify the most recent records in a table.
From the “Import Data” page, select your source system on the left, and then click the “Create Dataset with SQL” button at the top of the screen.
Trifacta allows you to create a dataset using a fully-specified SQL statement. This gives you the flexibility to incorporate parameters and stored procedures, or optimize your SELECT statement based on your source system.
To perform an incremental load, we are going to create a SELECT statement with a WHERE clause that includes a dynamic parameter.
4. Write your custom SQL statement
In the “Edit SQL” pop-up, you will be able to type a SQL statement. This SQL statement will be pushed down to the source system and executed, so make sure your syntax is correct for the specific flavor of SQL that your database uses.
Since we’re using BigQuery as our source, our statement will look like the following:
select * from gcp.datasources where createdAt >= ‘2020–08–12’
This SQL statement will select all of the records in the datasources table that have been created on or after August 12, 2020. At the moment, this is a static query that has hardcoded the date value in our WHERE clause. In our next step, we will make that filter dynamic.
5. Add a date/time parameter to your SQL statement
After you have written your initial SQL statement, select the date/time string with your mouse. Make sure you don’t highlight any formatting niceties surrounding the date/time string, like quotation marks.
Once you have highlighted the date/time string, you’ll see two black pop-up boxes hovering over your selection. Click on the box that contains the clock icon.
This allows you to add a timestamp parameter to your SQL statement. The timestamp parameter will be dynamically evaluated when you run your job.
After clicking on the “Add Timestamp Parameter” button, you can configure the format of your data and the specific period when the parameter will be evaluated. Make sure that the timestamp format that you enter matches the valid format from your original SQL statement.
The timestamp value determines which specific value will be dynamically inserted into your SQL statement. This value is evaluated at job run time.
For example, if a job that uses this custom SQL dataset as a source is executed on August 14, 2020, and the “Timestamp value” is set to “Relative to job start date minus 1 days”, Trifacta will populate the SQL statement with the string value “2020–08–13”.
Click “Save” on the pop-up to store your parameter settings. After you save your parameter, your custom SQL statement will include a dark blue bubble. This represents your dynamic timestamp parameter.
Click “Create Dataset” to save your custom SQL dataset. Import your custom SQL dataset.
You can now use your custom SQL dataset in any of your Trifacta flows. Go ahead and add any transformation recipes that you need to cleanse, enrich, or structure this source data.
6. Configure a scheduled destination
From the flow that contains your custom SQL dataset, add an output to your final recipe. This will open the details panel. We want to set a scheduled destination to this output. A scheduled destination determines where Trifacta will write your results every time a scheduled job is executed.
To configure a scheduled destination, click the “Add” button next to “Scheduled Destinations” in the output details panel.
This will open the “Scheduled Publishing settings” page. Click the hyperlink in the middle of the page that reads “Add new publishing action” to configure your output settings.
From the publishing settings page, navigate to your data warehouse and click “Create new table”. You can configure the name of the table that you are going to be creating. When performing an incremental load, we recommend selecting the “Append to this table every run” option. This ensures that whenever your job executes, Trifacta will write the newly processed records to the bottom of your existing table.
Click “Update”, followed by “Save settings”. You have now set up your output so that your delta records will be written into an existing table.
7. Schedule your flow
Finally, you’re ready to schedule your flow. From the output details panel, click the “Schedule it” hyperlink to configure the cadence at which Trifacta will execute your job.
Remember that scheduling works hand-in-hand with your parameterized input. Your date/time parameter will be evaluated dynamically whenever your schedule is executed.
That’s it! You have now created an end-to-end pipeline to identify new records in your source system, process those records using Trifacta, and then automate the pipeline to publish those records to your target data warehouse.
For more tips and tricks about Trifacta, incremental loads, and cloud data warehousing, check out my colleague Victor Coustenoble’s work: