Recently, we announced our native integration with Snowflake’s zero management cloud data warehouse. This blog will give an overview of common use cases for which this native integration can help, and what benefits organizations can have by adopting a modern cloud data lake or data warehouse paired with Trifacta–like improving data preparation efficiency and democratizing traditionally siloed processes. Whether you’re using Snowflake as a cloud data warehouse or as a data lake, we will share how to use Trifacta to accelerate and improve the following common use cases:
- Reporting and Analytics
- AI and ML
- Data Onboarding/Provisioning
We will cover each of these use cases with an in depth example in the latter parts of this series and talk about how we see each tackled effectively within our customers. In this blog, we will discuss at a high level how Trifacta’s data preparation platform natively integrates with Snowflake to achieve these objectives. The ideal process and set up for tackling these use cases varies from organization to organization, but a generic example is visualized below.
The above describes the following workflow: data is collected from it’s various systems and integrated into a staging area, either in Snowflake’s data lake, or as raw tables in Snowflake’s data warehouse using one of the variety of cloud ETL/ELT or data integration tools. If some of the data is unstructured–say log files, social media data, sensor data, etc.–and unsupported by the data warehouse–than landing the data in a data lake is the best option. From there, the data must be profiled, structured, cleaned, enriched, validated for data quality and eventually set up into an automated workflow using Trifacta. This is the Core Preparation stage, and is important whether the data lands first in the data lake or the data warehouse. Oftentimes, data engineers perform this core preparation phase but they collaborate with data scientists and data analysts to make sure all of the important information is captured and data quality is ensured.
For Reporting and Analytics use cases, there is an additional stage of filtering, aggregating, and subsetting this data to make it fit for purpose for downstream consumption in visualization platforms, compliance and regulatory reports, and other analytics processes. We see these use cases across all industries but they are especially prevalent in financial services, insurance, marketing and retail. This work is often done by data analysts and includes support from data engineers. Data quality is essential for accurate reporting and analytics. Additionally, central governance and collaboration is important for ensuring that access to data is tightly administered, a full audit trail is created through self documenting lineage, and redundant work is eliminated through sharing and collaboration. Features that make this process easier for data analysts include Transform by Example, ranking functions, pivots, unpivots, and group bys.
For Machine Learning and AI, data scientists require a stage to engineer features for the ML and AI models after the core cleaning and preparation work is done. This can include common ML preparation functions like one-hot encoding, scaling, standardizing, normalizing, and more to ensure the models have the right features in the right structure needed. Trifacta can eliminate the frustrating and time consuming aspects related to data preparation for each of these tasks.. Trifacta provides data scientists with the tools needed to create consistent, high quality, differentiated data for their models. Trifacta’s visual guidance and built in machine learning create an interface focused on ease of use, instant validation, and powerful transformations, ensuring users can perform the transformations and data quality checks they need without the need for code or debugging, accelerating time to deploy models and improving accuracy in supported data science platforms like Amazon Sagemaker and DataRobot.
For Data Onboarding use cases where the objective is to ready a clients data to fit a specific schema for use in a proprietary data product, most of the heavy work is done in the core preparation stage. We see this commonly in data service providers in industries like healthcare, pharmaceuticals, marketing analytics, and supply chain. These use cases often involve discovering anomalies and working with and blending unfamiliar data, standardizing formats and entity names, and restructuring the data to a specific schema used by the data products. Trifacta accelerates this process with features like Active Profiling, Rapid Target, Smart Cleaning, and Enhanced Joins.
This is part one of a multi-part blog series on Trifacta for Snowflake, for the next part we will be diving deeper into the reporting and analytics use cases and how we see Trifacta and Snowflake being leveraged at our customers to solve this use case. Stay tuned!