Class is Now in Session

Presenting The Data School, an educational video series for people who work with data

Learn More

Snowflake Data Prep for Data Scientists, Data Analysts and Data Engineers

September 6, 2020

Snowflake’s unique architecture allows organizations to store a wider variety of data formats and data types, including most SQL data types. The diversity of Snowflake databases and Snowflake types of data opens up new possibilities for creating insight-rich data using a data preparation platform like Trifacta. Whether you use Snowflake as a cloud data warehouse or data lake, Trifacta can accelerate time to value and improve data accuracy and quality for a variety of Snowflake data, including different types of Snowflake data, in the following use cases:

  • Reporting and Analytics
  • AI and ML
  • Data Onboarding

Snowflake Data Types in Practice

The ideal process and set up for tackling these use cases varies from organization to organization, but a generic example for reference is as follows: 

The above describes the following workflow: data of various data types is collected from its various systems and integrated into a staging area, either in Snowflake’s data lake, or as raw tables, columns, and rows in Snowflake’s data warehouse using one of a variety of cloud ETL/ELT or data integration tools. If there is semi-structured data or completely unstructured data without defined data types–say log files, social media data, sensor data, etc.–and unsupported by traditional data warehouses, Snowflake provides broader compatibility allowing for a wider variety of snowflake data with varying data types. 

From there, the snowflake data must be profiled, structured, cleaned, enriched, validated for data quality based on snowflake data types and eventually set up into an automated workflow using Trifacta. This is the Core Preparation stage and is important whether the data lands first in the data lake or the data warehouse. Oftentimes, data engineers, data scientists and data analysts collaborate on this stage to make sure all of the important information is captured and data quality is ensured.

Snowflake Data Tips

For Reporting and Analytics use cases, there is an additional stage of filtering, aggregating, and subsetting the snowflake data to make it fit for purpose for downstream consumption in visualization platforms, compliance and regulatory reports, and other analytics processes, making sure the data types align with the downstream needs. We see these use cases across all industries but they are especially prevalent in financial services, insurance, marketing and retail. 

This work is often done by data analysts and includes support from data engineers. Data quality is essential for accurate reporting and analytics. Additionally, central governance and collaboration is important for ensuring that access to data is tightly administered, a full audit trail is created through self-documenting lineage, and redundant work is eliminated through sharing and collaboration. Features that make this process easier for data analysts include Transform by Example, ranking functions, pivots, unpivots, and group bys. 

For Machine Learning and AI, data scientists must engineer features for the ML and AI models after the core preparation work is done to the snowflake data. This can include common ML preparation functions like one-hot encoding, scaling, standardizing, normalizing, and more–depending on the snowflake data types–to ensure the models have the right features in the right structure needed for the snowflake data. 

Trifacta can eliminate the frustrating and time consuming aspects related to data preparation for each of these tasks, providing data scientists with the tools needed to create consistent, high-quality, differentiated data for their models, with the correct data structures and data types. Trifacta’s visual guidance and built in machine learning create an interface focused on ease of use, instant validation, and powerful transformations, ensuring users can perform the transformations and data quality checks they need without the need for code or debugging, accelerating time to deploy models and improving accuracy in supported data science platforms like Amazon Sagemaker and DataRobot.

For Data Onboarding use cases where the objective is to ready a client’s data to fit a specific schema for use in a proprietary data product, most of the heavy work of preparing snowflake data is done in the core preparation stage. We see this commonly in data service providers in industries like marketing analytics, healthcare, pharmaceuticals, and supply chain. These use cases often involve discovering anomalies and working with and blending unfamiliar data, standardizing formats and entity names, and restructuring the data to a specific schema and type of data used by the downstream data products. Trifacta accelerates this process with features like Active Profiling, Rapid Target, Smart Cleaning, and Enhanced Joins

Data Preparation for Data Types

Data preparation is necessary to get the most of the data in Snowflake and the valuable databases. Without data prep, valuable insights may be lost from the Snowflake data. When working with databases, data types, and other information in Snowflake, it’s key to utilize the right data preparation tools. Trifacta integrates with Snowflake to help data analysts make the most of different types of data and glean crucial insights and new concepts from data stored in Snowflake. Precision in data analysis is key, and Trifacta and Snowflake make that happen. Request a demo to see Trifacta in action.

Related Posts

Managing Big Data With Tableau Hadoop

In the age of big data, the dynamic duo of Tableau Hadoop harnesses the power of real-time data... more

  |  June 16, 2016

Floating Elephants: Developing Data Wrangling Systems on Docker

Technology evolves quickly in the big data ecosystem and deploying the latest tools is a complex undertaking.... more

  |  June 29, 2015

Predicting COVID-19 Cases with Machine Learning and Trifacta

In the fight against COVID-19, one of the best weapons at our disposal is data. But interpreting COVID-19... more

  |  October 14, 2020