Join us on April 7-9, 2021

The first industry event focused on data engineering

Register Today
All Blog Posts

Trifacta for Snowflake: Data Prep for Your Cloud Data Warehouse or Data Lake – Part 2

August 19, 2019

Recently we announced our native integration with Snowflake’s zero management cloud data warehouse. Part 1 of this blog series gave an overview of the immediate value organizations can realize when adopting Trifacta for Snowflake. In Part 2 of this blog series, we will focus on one of the three main use cases we highlighted where Trifacta can accelerate time to value — Reporting and Analytics.

Let’s take another look at the following diagram.

All of the above use cases hold Core Preparation in common. In most of the examples we see at our customers, there is a certain amount of initial structuring, cleaning, and blending done up front.

For Reporting and Analytics use cases, there is an additional stage of filtering, aggregating, and subsetting this data to make it fit for purpose for downstream consumption in visualization platforms, compliance and regulatory reports, and other analytics processes. We see these use cases across all industries but they are especially prevalent in financial services, insurance, marketing and retail. This work is often done by data analysts and includes support from data engineers. Data quality is essential for accurate reporting and analytics. Additionally, central governance and collaboration is important for ensuring that access to data is tightly administered, a full audit trail is created through self documenting lineage, and redundant work is eliminated through sharing and collaboration. Features that make this process easier for data analysts include Transform by Example, ranking functions, pivots, unpivots, and group bys, among others.

The biggest bottleneck in time to insight in reporting and analytics is generally that the data is provisioned by a centralized IT team performing ETL processes who are inundated by requests from various business units and data teams. Consumers of the data have certain specifications that require data to be in a certain shape with certain attributes. If IT teams give access to less refined pieces of information, the consumers of data are then required to do a lot of manual work in tools like Excel or Python. Additionally, many organizations are stuck with legacy reporting processes and desktop business tools to drive reports which leaves the potential for data quality issues, redundant work, and lack of compliance. The lack of governance of these platforms leads to multiple sources of truth, a lack of repeatability, and ultimately a lack of trust in data quality. It’s also hard to automate work in Excel and if the users ever change it’s painful to onboard work to others in Python.

Trifacta provides business teams with the automation capabilities of technical platforms with the ease of use of tools like Excel. Trifacta’s platform is centrally governed and access and permissions are configurable. Data never leaves the source systems, ensuring security.  Businesses are able to be more agile, ensure faster time to insight with higher quality data, and collaborate with other teams. Snowflake’s infrastructure as a service frees up technical resources to focus on improving process, access and compliance for business teams, and business teams are able to leverage the data they need and refine it for their purposes.

Let’s take a look at a marketing analytics use case to show the value of Trifacta for Reporting and Analytics.

Marketing Analytics in Trifacta

For this example, we’ll play the role of a marketing analytics team within a broader marketing organization. It’s important to deliver insights that the marketing and sales teams need to effectively sell our product. As our business grows, we need to adapt to new reporting requirements and quickly prepare familiar and unfamiliar data for consumption in our analytics platform. Our IT team has set up integration processes to bring all of our needed data in its raw state onto Snowflake. We need to connect to that data, clean up some of the columns, blend data together, extract important information, and finally set up an automated schedule to make sure the cleaned data is consistently feeding our dashboards. This is the Core Preparation piece of our workflow. Many individuals on the team then take the data automated in Trifacta and do further aggregations and subsetting before analyzing it. This is the aggregating, filtering, and subsetting piece that is key for self-service analytics. Let’s see what that looks like.

Our IT team has already set up our Snowflake connection, so we can go ahead and explore our available data and connect to the datasets we need for our reporting purposes. Trifacta makes it easy to explore the datasets and find what we need to get our job done.

Next, we’ll start to edit our recipes and clean up some of the data we have. Trifacta immediately highlights data quality issues like a lack of consistent date formats, and other anomalies like error messages in the data. It’s easy to address these issues by filtering out rows with error messages, and fixing inconsistent date formats with Transform by Example by entering my preferred output format, and Trifacta will do the work behind the scenes to get my data, regardless of its original format, to my desired output. As an analyst without much technical expertise, this makes my job significantly easier as there isn’t any syntax I need to know to accomplish any of these tasks.

We can use the continuous profiling to further clean and refine our data. There’s one user_id that appears to be showing up at a far higher rate than all others, and it also appears to all be coming form the same source website happening at the same time of day. This might indicate a bot, so we can remove those. We can also see some information that some team members might want to aggregate on in the downstream analysis, like url code and product id string. Just by clicking the column, Trifacta will provide suggestions for extracting the domain. Similarly, we can highlight over the product ID string and Trifacta will generate a suggestion to extract this information from the url column.

Once we’ve done some cleanup work on the first dataset, we can move on to joining this data with the other dataset in the flow. Again, Trifacta is providing continuous profiling to ensure we validate every step taken throughout the preparation process.

Many of the companies in this dataset we have joined in have issues with misspellings. Using Trifacta’s cluster clean, we can standardize all values to the preferred output value, which cleans up all of the spelling mistakes in this column.

Once this data is cleaned up, the core preparation component of the workflow is complete. This flow is ready to be scheduled and made available to others who can find use in it. We’ll create a schedule publish to our Snowflake analytics virtual warehouse daily and label this table MASTER.

In this particular example, a data analyst can pick up this work, do some aggregation and subsetting down to the examples they are looking at, and feed that to their downstream dashboards or analytics platforms.

This example shows a few of the benefits that enable Trifacta users to accelerate time to value in the analytics workflow on Snowflake. It gives users access to the data they need to get their job done, and empowers users with functionality to develop and automate data preparation work. Most importantly, all of this work is done on a centralized platform to ensure data governance best practices.

We’ll continue this blog series in the near future with a dive into the machine learning and AI use case. If you are interested in trying Trifacta to see how it could benefit your organization, sign up today!