Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Moving Analytics to the Cloud?

Survey of 600+ data workers reveals biggest obstacles to AI/ML in the cloud

Get the Report
Schedule a Demo

Snowflake Data Prep for Data Scientists, Data Analysts and Data Engineers

October 29, 2019

Snowflake’s unique architecture allows organizations to store a wider variety of data formats and data types, including most SQL data types. The diversity of Snowflake data and Snowflake data types opens up new possibilities for creating insight-rich data using a data preparation platform like Trifacta. Whether you use Snowflake as a cloud data warehouse or data lake, Trifacta can accelerate time to value and improve data accuracy and quality for a variety of Snowflake data, including different types, in the following use cases:

  • Reporting and Analytics
  • AI and ML
  • Data Onboarding

The ideal process and set up for tackling these use cases varies from organization to organization, but a generic example is as follows: 

The above describes the following workflow: data of various data types is collected from its various systems and integrated into a staging area, either in Snowflake’s data lake, or as raw tables in Snowflake’s data warehouse using one of a variety of cloud ETL/ELT or data integration tools. If there is semi-structured data or completely unstructured data without defined data types–say log files, social media data, sensor data, etc.–and unsupported by traditional data warehouses, Snowflake provides broader compatibility allowing for a wider variety of snowflake data with varying data types. From there, the snowflake data must be profiled, structured, cleaned, enriched, validated for data quality based on snowflake data types and eventually set up into an automated workflow using Trifacta. This is the Core Preparation stage and is important whether the data lands first in the data lake or the data warehouse. Oftentimes, data engineers, data scientists and data analysts collaborate on this stage to make sure all of the important information is captured and data quality is ensured.

For Reporting and Analytics use cases, there is an additional stage of filtering, aggregating, and subsetting the snowflake data to make it fit for purpose for downstream consumption in visualization platforms, compliance and regulatory reports, and other analytics processes, making sure the data types align with the downstream needs. We see these use cases across all industries but they are especially prevalent in financial services, insurance, marketing and retail. This work is often done by data analysts and includes support from data engineers. Data quality is essential for accurate reporting and analytics. Additionally, central governance and collaboration is important for ensuring that access to data is tightly administered, a full audit trail is created through self-documenting lineage, and redundant work is eliminated through sharing and collaboration. Features that make this process easier for data analysts include Transform by Example, ranking functions, pivots, unpivots, and group bys. 

For Machine Learning and AI, data scientists must engineer features for the ML and AI models after the core preparation work is done to the snowflake data. This can include common ML preparation functions like one-hot encoding, scaling, standardizing, normalizing, and more–depending on the snowflake data types–to ensure the models have the right features in the right structure needed for the snowflake data. Trifacta can eliminate the frustrating and time consuming aspects related to data preparation for each of these tasks, providing data scientists with the tools needed to create consistent, high-quality, differentiated data for their models, with the correct data structures and data types. Trifacta’s visual guidance and built in machine learning create an interface focused on ease of use, instant validation, and powerful transformations, ensuring users can perform the transformations and data quality checks they need without the need for code or debugging, accelerating time to deploy models and improving accuracy in supported data science platforms like Amazon Sagemaker and DataRobot.

For Data Onboarding use cases where the objective is to ready a client’s data to fit a specific schema for use in a proprietary data product, most of the heavy work of preparing snowflake data is done in the core preparation stage. We see this commonly in data service providers in industries like marketing analytics, healthcare, pharmaceuticals, and supply chain. These use cases often involve discovering anomalies and working with and blending unfamiliar data, standardizing formats and entity names, and restructuring the data to a specific schema and type of data used by the downstream data products. Trifacta accelerates this process with features like Active Profiling, Rapid Target, Smart Cleaning, and Enhanced Joins

Related Posts

Data Preparation For Data Mining

Data preparation for data mining is a critical step to take in any big data effort. Sometimes, beginner data... more

  |  June 21, 2016

Leveraging On-Cluster Visualization for Faster Insights

We’ve talked a lot about the data lake ecosystem on our blog to help organizations accelerate adoption and... more

  |  December 7, 2016

Can Numbers Talk?

This past Sunday, the data scientist Nate Silver spoke at the University of California, Berkeley.  A group... more

  |  May 6, 2014