Data School

Presenting The Data School, our online resource for people who work with data

Learn More
All Blog Posts

Data Preparation Best Practices for Snowflake Data Warehouses

November 4, 2020

Snowflake is a platform known for their separation of storage and compute, which makes scaling data more efficient. However, to get the most value from your investment in Snowflake’s Cloud Data Warehouse, your organization must break through the biggest bottleneck to analytics and AI: data preparation.

Here are five data preparation best practices your organization should adopt to harness the scalable computational power, agility, and cost efficiency of the Snowflake data warehouse for analytics, machine learning, and data visualization.

  1. Empower all stakeholders
  2. Focus on the right use cases
  3. Enable self-service while maintaining data governance
  4. Ensure data quality at scale with continuous validation
  5. Automate preparation of data for downstream analytics and machine learning

Empower All Stakeholders

Your data preparation processes on your data warehouse should empower all stakeholders to coordinate and do their jobs faster and easier:

  • Data analysts, who need to explore, structure, clean, blend, aggregate into the correct schema for data warehousing, and validate data quality with data closer to the source to improve time to value and open up new areas for insights
  • Data scientists, who perform data exploration, analytics, modeling, and algorithm development on a wide variety of data sources and structures and collaborate with business leadership to determine the analytical insights that drive innovation and achieve business objectives
  • Data engineers, who design, build, and manage data processes and data architecture to support analytics and data science and need to automate data-related processes to be able to manage more of them

A data preparation solution that offers self-service capabilities and visual guidance and AI-driven recommendations for data transformation can help all stakeholders make the best use of a Snowflake data warehouse for quickly preparing the data and getting it into the right schema for data warehousing.

Focus on the Right Use Cases

Traditional extract, transform, and load (ETL) grew up as a solution for standardizing preparation of data for carefully structured enterprise data warehouses with a strict, set schema. But when it comes to exploring, structuring, blending, and cleaning huge volumes of new, diverse, less-structured data, organizations need new alternatives for accelerating and automating these processes with more flexible and changing downstream schemas. 

Modern cloud data warehouses like Snowflake can operate as a data lake with less strict requirements on structured data and set schema for a data warehouse. Snowflake can deliver value faster if your organization focuses on the right data preparation use cases. 

Most organizations are satisfied with reporting use cases involving structured, unchanging data schemas like transactions data. It doesn’t make sense to start by trying to rebuild what already works. Instead, focus on where your data analysts and data scientists struggle to get beyond traditional reporting, querying, and visualization methods in Snowflake’s data warehouse— for example, using less structured data like IOT, application data, log data, etc. to enrich and enhance data. Focus on use cases involving lots of manual preparation work in desktop tools or code heavy environments. Focus on use cases where business teams rely on IT teams to provision datasets where requirements often change. Focus on free-ranging data exploration initiatives that exceed the capacity of standard SQL or ETL and have changing schemas for data warehousing. Focusing on these types of use cases will help you take advantage of your Snowflake cloud data warehouse

Enable Self-Service While Maintaining Governance

Self-service data preparation is critical. Nontechnical users need solutions that automate steps for exploring, profiling, structuring, cleaning, enriching, and automating manual data preparation work in a Snowflake data warehouse without having to rely on limited IT resources. 

How does your organization find the right balance between empowering users to derive value from data as they see fit while protecting data assets as part of good data governance and security practices?  Here are three ways:

  1. Keep data silos from proliferating as users collect data extracts and run their own preparation routines, often on spreadsheets.
  2. Use shared central catalogs or glossaries to manage data definitions and metadata, and changing schemas for data warehousing 
  3. Track and document data lineage during preparation and transformation processes

Ensure Data Quality at Scale with Continuous Validation

Snowflake data warehouses often contain huge data volumes and a wide spectrum of data types—everything from raw, semi-structured data to structured, transactional data from multiple systems. As such, Snowflake’s data warehouse opens up a broader array of data to extract value from, which requires a more dynamic approach to data quality over more traditional rigid processes. For instance, Snowflake uses independently performing virtual warehouses to process queries.

Your organization can improve overall performance as well as the accuracy, consistency, and completeness of data in a warehouse by using data preparation solutions that combine a visual approach with machine learning to automate data cleaning procedures and provide insights into anomalies and data quality issues. Automation can handle  the scale of Snowflake data warehouses and identify data values that appear to be incorrect, invalid, missing or mismatched. This warehouse also gives organizations the ability to use less structured datasets with changing schemas, which requires a data preparation solution for getting the most out of this data.

As greater and greater volumes of new diverse data is ingested and integrated into your Snowflake data warehouse, your organization needs to continuously validate that  data quality is being maintained and the needed schema for downstream analytics is met. Continuous validation means that users shouldn’t have to wait until the end of a validation process to view and test results—a process that’s integral to agile development methods. Your organization needs a data preparation solution that immediately and automatically detects potential data quality issues in large volumes of data, as well as monitor data flows and alert users when new data is ready for validation.

Automate Preparation of Data for Downstream Analytics and Machine Learning

In your Snowflake warehouse, a vast and growing volume data is retrieved from a huge number of sources, including Internet of Things (IoT) sensors, mobile devices, cameras, customer behavior, applications, and more. As the data generated by digital transformation explodes, so too does the opportunity for outcompeting on differentiated, value rich data. 

Data preparation routines should be scheduled, published, and operationalized and shared to reduce redundancies and ensure broad access to value rich data. Your organization should consider running data preparation natively and automatically within Snowflake’s data warehouse to:

  • Accelerate time to value
  • Reduce operational costs
  • Improve monitoring and governance

Centralizing the scheduling, publishing, operationalizing of data preparation routines results in less redundancy and inconsistency, more portability, and better management and governance. When coupled with integration with data catalogs, centralization increases the potential for reuse across different data consumers who can share knowledge of how data needs to be massaged for front-end tools, machine learning development frameworks, visualizations, and reports. 

Ready to Learn More?

With seamless data preparation across any cloud, hybrid or multi-cloud environment, Trifacta Wrangler is the ideal data preparation solution for your Snowflake Cloud Data Warehouse. Schedule a demo or try Trifacta Wrangler today.