Class is Now in Session

Presenting The Data School, an educational video series for people who work with data

Learn More

Snowflake Software and Trifacta

September 20, 2020

The cloud computing market is often boiled down to a race between the “Big Three” cloud providers—Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). But while these platforms may anchor an organization’s cloud strategy, they are far from the full picture. The cloud computing market is made up of numerous technologies, products, and services that are ever-expanding as the market continues its astronomical growth

One such cloud player is Snowflake, a cloud-based data warehousing company that made waves after holding the biggest software IPO ever. It allows companies to unite their cloud data workloads under a single platform that leverages the full power of the cloud. 

Below, we dig into the specific benefits of the Snowflake software and how the Trifacta data preparation platform works with Snowflake hand-in-hand in order for users to get the most out of their cloud data storage. 

What is Snowflake software?

Snowflake solves a niche problem that came about with the rise of the cloud—as companies began to migrate their data warehouses and data lakes to the cloud for improved elasticity and scalability, the challenge of interoperating between these systems remained. Plus, many of these vendors weren’t built from the ground up for the cloud, which limited their ability to take full advantage of the resources and elastic capabilities of the cloud. 

Enter the Snowflake “data platform,” or an integrated cloud platform delivered as-a-service. The data platform, Snowflake describes, allows users to securely share and consume data from a single solution that combines the best components of enterprise data warehouses, cloud data warehouses, and modern data lakes. With the Snowflake software, all of the workloads that organizations run from the cloud are no longer separated, but operated under and connected by one platform. 

How does Snowflake work?

Snowflake is available as a service on leading cloud providers such as AWS, Azure, and GCP. Its cloud-agnostic layer means that it delivers a consistent experience across cloud regions and providers. As mentioned, it is a platform delivered as-a-service—that means there’s no hardware for users to select, install, configure, or manage. 

What makes the Snowflake software unique is its architecture. Under traditional data platforms of Snowflake competitors that have fixed compute and storage resources, concurrency, or the ability to perform many tasks simultaneously, is limited. In contrast, the Snowflake software leverages a multi-cluster, shared architecture that separates out compute and storage resources so that they can be scaled independently and fully leverage all the resources of the cloud. This means that a data scientist can be querying training data for a machine learning model at the same time that a data engineer is ingesting data and none are the wiser—performance isn’t sacrificed. 

In addition to separate but interrelated compute and storage resources, the Snowflake data platform also offers an additional layer of cloud services, which automates common administrative, security and database tasks. These services help coordinate transactions across all workloads. 

In sum, the three main components of the Snowflake data platform that are integrated, but scale independently include:

  • Storage
    One platform for the storage of all types of data, whether structured or unstructured.
  • Compute
    Interrelated but independent compute resources that improve performance.
  • Services
    Handles infrastructure, security, metadata, and query optimization across all workloads.

Data preparation on Snowflake

No matter where data lives, whether it’s Snowflake big data or data stored in another type of cloud data lake or data warehouse, it must be cleansed and prepared for use in analytic projects. “Garbage in; garbage out” goes the saying, and it’s more true now than ever. Data continues to explode in size and complexity, which has urged organizations to put sound data preparation practices in place. 

Much like cloud computing, data preparation techniques and technologies have also made significant advances in step with the changing data landscape. Instead of hand coding or Extract, Load, and Transform (ETL) tools, data preparation platforms have been the focus of leading analysts and industry professionals. A data preparation platform offers all the power of coding languages under the hood of a visual, user-friendly interface. Data preparation platforms quickly surface any errors or outliers and greatly reduce time spent on the overall data prep process. 

So how does a data preparation platform work in tandem with Snowflake? Should you implement the two technologies, the workflow would look something like this:

  1. First, data is collected from its various systems and integrated into a staging area, either in Snowflake’s data lake, or as raw Snowflake tables in a data warehouse using one of the variety of cloud ETL/ELT or data integration tools. If some of the data is unstructured, for example log files, social media data, sensor data, etc. and unsupported by the data warehouse, then landing the data in a data lake is the best option.
  2. Next is the “core preparation stage” where data must be profiled, structured, cleaned, enriched and validated for data quality in the data preparation platform. Oftentimes, data engineers perform this core preparation phase but they collaborate with data scientists and data analysts to make sure all of the important information is captured and data quality is ensured.
  3. After the data has been properly prepared, it is ready for its respective use case, which commonly include:
    1. Reporting and Analytics
      In this use case, there is an additional stage of filtering, aggregating, and subsetting this data to make it fit for purpose for downstream consumption in visualization platforms, compliance and regulatory reports, and other analytics processes.
    2. Machine Learning and AI
      In this use case, data scientists require a stage to engineer features for the ML and AI models after the core cleaning and preparation work is done. This can include common ML preparation functions like one-hot encoding, scaling, standardizing, normalizing, and more to ensure the models have the right features in the right structure needed.

Snowflake software and Trifacta

Trifacta is widely recognized as the industry leader in data preparation and, in 2019, Trifacta became the first data preparation platform to natively integrate with Snowflake software. Together, Trifacta and Snowflake empower data professionals to connect, wrangle and publish clean data for analytics, machine learning and AI leveraging a centrally governed, native cloud platform. 

Here are three of the major benefits of using the Trifacta data preparation platform with Snowflake: 

  • Onboard Diverse Data Faster
    Discover anomalies, clean messy data and blend disparate data sources prior to publishing to Snowflake’s data warehouse for analytics.  
  • Modernize Reporting and Analytics
    Utilize visual and machine learning-driven guidance to empower diverse users while also leveraging the elastic scale and automation of modern cloud platforms  
  • Accelerate ML/AI Initiatives
    Improve the speed and quality of common data prep tasks for machine learning such as feature engineering, attribute standardization and one-hot encoding. 

In addition, Trifacta provides support for a range of native Snowflake features such as security, access controls and encryption as well as the cloud platform services required for deploying Snowflake on AWS, Azure and Google Cloud.

Learn More

When it comes to cloud data storage, Snowflake has become a popular choice among leading organizations. But data storage is only one piece of the puzzle—organizations need to ensure they have the right complementary technologies in order to maximize their investment, such as a data preparation platform. 

Trifacta’s leading data preparation platform reduces time spent preparing data by up to 90% and mitigates the risk that dirty data will end up as the basis of business-critical analytics projects. To learn more about Trifacta and its native integration with Snowflake, schedule a demo with our team or try the product out for yourself for free.

Related Posts

Publishing Data to Snowflake Using Trifacta Data Quality Rules 

When publishing data to cloud data warehouse Snowflake for analytic use, data quality is of the utmost... more

  |  October 27, 2020

The Benefits of Wrangling Data
Across Cloud and On-Premise

Many organizations are moving their data to cloud-based environments, but it’s a transition that cannot be... more

  |  September 11, 2017

What is Data Structure? Using Basic Data Structures to Organize Like Martha Stewart

How can Martha Stewart be of any relevance in a blog post titled “What is data structure?” Stay with us... more

  |  May 19, 2020