On Wednesday, we announced new functionality to support data engineers within growing data operations (DataOps) practices. Like its predecesor DevOps, DataOps is a process-oriented function that focuses on improving data quality and reducing analytic cycle times across the organization. As the amount of data has grown, so too has DataOps; in 2018 alone, 73 percent of organizations indicated that they had planned to invest in DataOps.
Encompassed within the broader framework of DataOps is data preparation. Data engineers are responsible for developing, testing, scheduling, and monitoring data preparation pipelines in production. The more effectively and efficiently they can manage data preparation pipelines, the better they can scale out data preparation—and analytics in general—across the organization. Hardening these pipelines are a key area of importance to solving the data preparation bottleneck, which accounts for up to 80% of the time and resources of any data project.
Given the importance of DataOps and the data engineer, our latest functionality addresses the day-to-day requirements of the data engineer. In this post, I would like to talk about RapidTarget.
RapidTarget allows data engineers to set a predefined schema target that provides automated guidance for how diverse data sources must be structured, formatted and renamed in order to match to the desired target.
Let’s start with a simple example. Many of our users belong to industries that rely on external parties for sources of their data. In order for this data to be consumed by internal users or downstream customers, it has to be transformed to match an existing schema or data model. This process is commonly referred to as “data onboarding” and it spans a variety of industries. For example, retailers might have to join together hundreds of data sets from individual vendors, or insurance companies preparing policy details from agents all around the country.
In the following example, some of the steps required to prepare the data are:
- Merge names with a comma
- Change the date format to DD/MM/YYYY
- Standardize gender to M/F
- Split street address into individual attributes
- Re-order and rename columns
Source Raw Data
Desired Target Result
With RapidTarget, users can import constraints and data from a target destination into Trifacta for use during the transformation process. As a starting point, users can import common schema constraints from the target destination, such as
- Structure of the data (Order of columns, Column data types, Uniqueness in Structure of Data)
- Data Formatting (Dates – YYYYMMDD or DDMMYYY)
- System Constraints (Character limits, Timezone formats)
Trifacta uses these constraints to create a set of rules within a logical target that can be attached to a recipe. Within the transformer, Trifacta applies data validation and transformation suggestions to help users align the source dataset with the target destination schema.
How to use the Target to align your source data in Trifacta
From the flow view, users can import a Target and associate it with a recipe. Users can import data from any file, relational database, data platform (Hive / Redshift / SQL Data Warehouse) that Trifacta supports as a source dataset.
Once associated with a recipe, the Target appears in the transformer as an overlay above the source data. You can see the Target column names, order, data types and sample of the data itself in relation to the source data you are transforming.
Here’s another example based on real data
This view provides easy validation of mismatches between columns, with the column level indicators indicating how closely the current source matches the Target.
In most cases, matches are determined by column name, position and type resulting in the GREEN, BLUE and RED indicators above. In this screenshot, we see green indicators, which represent a complete match; blue indicators, which represent a partial match where the column names are identical but the data types are different; and red indicators, which represent data where no match was found between the source and target.
In advanced cases, the column indicators show PURPLE when Trifacta’s ML-based pattern matching algorithm identifies that the data in two columns have a high degree of similarity, but the column names are different.
Hovering over these column level indicators, Trifacta suggests specific transformations for the column based on the target. Here are a few examples:
Reordering the source column position to match the Target.
Converting the source data type format to the Target’s format.
If the user chooses to apply these transformation suggestions, they are added to the recipe as a step which can be edited or removed like any other recipe step.
When dealing with very wide datasets with many columns, the column browser view in the transformer provides a broader view of the mismatches.
In cases where there are many partial matches found, Trifacta can provide transformation suggestions that apply across multiple columns by choosing “Apply Matches by Name.”
Applying the automatic algorithm across all the columns results in multiple steps. Each of these steps can be viewed in the recipe where users review, edit or undo the steps as needed.
The goal is to increase efficiency and reduce errors in the data preparation workflow, while also providing the user with interactivity, transparency into the logic being generated and the ability to easily make custom edits.
We’re excited for the release of RapidTarget, which gives data engineers greater support in aligning their data to predefined schema targets, and is particularly important in instances of data onboarding, as illustrated. Stay tuned for our upcoming blogs where we’ll review the additional features in this release, Automator and Deployment Manager.
RapidTarget is immediately available. If you’d like to get your hands on them schedule a demo with our team.