Trifacta wins the Best Data-Driven SaaS Product award at the 2021 Annual Cloud & SaaS Awards

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

Data Preparation for the Lakehouse

May 25, 2021

The Lakehouse represents a new way of implementing a data architecture. It combines the best benefits of data warehouse and data lake architectures. In particular, a Lakehouse combines the high performance and ease of use of a traditional data warehouse with the flexibility and low cost of a data lake.

However, an organization seeking to get value from a Lakehouse is still faced with a significant obstacle. Organizations are made up of people, and people must have the requisite skills to transform data as it flows through the various layers of the Lakehouse architecture. In the past, this ability was confined to a small set of power users who possess the necessary coding and SQL expertise needed to unlock the business value in the data.

This is where data preparation comes in. Data preparation (also known as data wrangling) provides a visual, intuitive interface to perform complex data engineering tasks. It democratizes access to data-driven insights by lowering the barrier to entry, and can be used by knowledge workers, data analysts, data engineers and data scientists alike.

The rest of this article covers how Trifacta and Databricks help you realize the benefits of the Lakehouse architecture and distribute these benefits to a wide audience.

Lakehouse Multi-Hop Architecture

A lakehouse typically implements a multi-hop architecture, where data flows between tables of differing structure and data quality.

Figure 1: Delta lake architecture (Image source: Databricks blog).   

The data in its raw form is loaded into ingestion tables (bronze). Then, it is progressively structured and cleaned into refined tables (silver) and further into a feature/aggregated data store (gold). Each hop requires data transformation to impose additional structure on the data and improve data quality. The ultimate goal is to make the data fit for downstream consumption. By staging the data in this way, you build a modular architecture with minimal duplication of effort, as there may be multiple consumers at each stage.

Data Wrangling Steps

For each hop from bronze to silver, or silver to gold, you go through an iterative process to build a pipeline, involving the following steps.

Figure 2: Data wrangling steps

You start by discovering what is in your data and how it relates to your objectives. Data comes in all shapes and sizes, so it must be structured to be fit for purpose. Cleaning involves correcting or removing information that may distort your analysis. Enriching allows you to ask questions about other data that may be useful to your analysis. Validating is the activity that allows you to confirm that all the necessary structuring, cleaning and enrichment steps have been completed correctly, and the resulting data meets the required data quality. Finally, publishing delivers the output of data wrangling to downstream project needs.

Here is how the Lakehouse architecture and data wrangling fit together:

Figure 3: Data wrangling for the Lakehouse

Input → Process → Output

So far, we have looked at the high level concepts related to the lakehouse and data wrangling. But how does a user make this happen? In this section, we look at an example of how a user can define the input, process and output for a data pipeline. All of this can be done visually, without writing any code.

Figure 4: Data pipeline

Figure 4: Data pipeline

Input starts with a connection to the data source. In this case, the source is a raw table in the Ingestion stage (bronze). You begin by browsing the connection to find the dataset you are interested in. Trifacta lets you easily browse and preview Databricks tables, including Delta tables and discover the ones you want.

Figure 5: Import data

Once a dataset has been added, you create a recipe that specifies how the data should be transformed. Transformations can include restructuring data, as shown in the Pivot step (step 2 in the screenshot below), as well as cleansing and data enrichment. As you build the recipe, Trifacta guides you through suggestions for transformations, and provides previews at each step. This allows even a less technical user to rapidly iterate through the recipe and measure their progress as they go.

Figure 6: Build recipe

Along with the recipe, the user can also define data quality rules that declare what clean data should look like. These rules guide the transformation process, highlighting anomalies and offering suggestions on how to fix them. Trifacta uses machine learning algorithms to detect patterns and outliers, which greatly enhances the user’s productivity.

When the user is happy with the recipe’s output, they can define a publishing action to create an output from it. The output can be written to Databricks tables, including Delta tables. The output is stored in the Refined table (silver) or Aggregated data store (gold) stage. The user can choose between creating a new table, appending to an existing table, truncate and write, or drop and write. Optionally, the user can write to an external table.

 

Figure 7: Publish outputs

The user can choose Databricks as the execution environment, and run a job to produce outputs. This causes Trifacta to push recipe execution to the Databricks spark cluster. It is possible to fine tune the cluster’s settings to balance performance vs cost (eg. by adjusting memory, number of cores, etc.). It is also possible to choose between a shared cluster, or user-specific or job-specific clusters for maximum flexibility.

Managing the pipeline

While one-off job executions are great, managing a data pipeline often involves running jobs on a schedule, and monitoring their execution. Trifacta allows less technical users to manage complex pipelines involving multiple steps. Jobs can be parameterized at run time, and can optionally fire webhook calls to external applications upon success or failure.

Each job generates a detailed profile of the results, which serves as an audit trail. You can also monitor data quality by reviewing the results of data quality rules.

Figure 8: Results of data quality rules

If errors are encountered during data processing, an alert can be sent to the user, so that they can take corrective action.

Conclusion

The Lakehouse architecture offers an elegant solution for processing big data. By combining the power of the Lakehouse architecture with self-service data preparation, you can greatly expand the business impact of your technology investment by democratizing the work of data engineering.

Further Reading

  1. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics [http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf]
  2. Trifacta Partners with Databricks to Deliver Faster ROI on Data Lakehouses [https://www.trifacta.com/blog/trifacta-partners-with-databricks-roi-data-lakehouses/]
  3. Productionizing Machine Learning with Delta Lake [https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html]
  4. It Takes a Village to Raise a Cloud Analytics Platform [https://www.trifacta.com/blog/village-raise-cloud-analytics-platform/]
  5. Create Databricks Tables Connections [https://docs.trifacta.com/display/r076/Create+Databricks+Tables+Connections]
  6. Trifacta’s Partner Databricks Announces GA Launch on Google Cloud [https://www.trifacta.com/blog/databricks-announces-launch-on-google-cloud/]