Recently, we announced new functionality to support data engineers within growing data operations (DataOps) practices. This is an exciting shift as we see the role of the data engineer growing increasingly important and critical to helping our platform expand and mature. Data engineers are leading the charge to scale data preparation across the organization and ensure that data prep workflows are efficient, repeatable, and governed. As the role of the data engineer has expanded, we’re continuing to update the Trifacta platform to ensure it meets their needs. Our latest functionality demonstrates Trifacta’s commitment not only to the end user, but also to the critical collaboration between those end users and data engineers.
In an earlier blog, my colleague Sean discussed how RapidTarget is allowing data engineers to set a predefined schema target. Now we’re excited to talk a little more about Automator, our system to intelligently manage scaling, scheduling and monitoring data prep workflows in production.
Often, there’s little control over how the data that needs to be wrangled is organized. Instead, data engineers must work with what they have. We’ve introduced parameters and variables to help deal with these situations. They allow you to specify what input data you want Trifacta to use each time a job is run. Parameters and variables can be used in datasets created off of both files and databases.
Working With File Paths
When you’re reading files from a file system you need to work around how the data is organized. You might have dates split across multiple levels of your path or file name, parts of a path you may want to match to any instance of a pattern (like an email address), or parts you want to be able to set a value for later on when you set up a schedule, kick off a job, or invoke Trifacta’s APIs (like a geographic region, customer name, etc…). In order to do all this (and more) you can now use three new features when defining your file paths: datetime parameters, pattern parameters, and variables. I provide an overview of each in the following section.
Datetime parameters let you point out dates or times in your file paths and define rolling ranges that Trifacta will dynamically resolve at job run time. This lets you do valuable things like match the last two weeks of files partitioned by date (e.g. 2018/04/orders_25.csv).
Pattern parameters let you use wildcards or regular expressions (Trifacta patterns coming soon!) in your file paths. You can do things like use a wildcard to ignore the file extension or use a regular expression to select all folders that match an email address.
Variables let you define a part of a path that you’ll have the ability to override later on when running a job, setting up a schedule, or invoking Trifacta’s APIs.
You can use all three together to create powerful file matching rules.
Variables in Custom SQL
Trifacta’s custom SQL editor already provides a powerful way for those of you that are comfortable writing SQL to select exactly the right data. You can select columns, filter your data, pre aggregate, create calculated fields, and join with other tables in your database all before bringing your data into Trifacta to wrangle. We’re making it even better by letting you use variables in your SQL statements. You can replace as much or as little of your SQL statement as you want with a variable.
Like with file paths you have an opportunity to pass values in for your variables when you run a job, either from within the application or via our API.
We’re excited about the release of Automator, which gives data engineers critical control over scaling, scheduling, and monitoring data prep workflows. Stay tuned for our final blog in the series where we’ll review the final feature in this release, Deployment Manager.