Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

Good Stuff You Can Learn From Bad Data

September 2, 2021

Let’s say you’re remodeling your kitchen. You want to replace the old linoleum with beautiful new hardwood floors. But as you rip up the old flooring, you realize the subfloor next to your kitchen sink is rotted through.  

What do you do? At a minimum, you need to patch the hole. You may need to replace the subfloor altogether. And should probably call a plumber to diagnose and repair a leaking pipe. But you don’t start nailing down the new hardwood on top without taking care of the damage, right? You don’t want to cover up the bad stuff with the good.

The same goes for the data in your data lake. Unlike data warehouses, which tend to offer only clean data, data lakes store and retain original raw data for various types of analyses. The ability to identify anomalies in source systems and data is enormously valuable. 

“Data quality routines should not whitewash bad data,” writes David Menninger in a 2021 Ventana Research Analyst Perspective: Why Your Data Lake Needs Bad Data. “Identify and resolve the source of data quality problems so your organization can operate with the most accurate data possible.”

While clean data is important, storing and retaining original raw data—the good, the bad, and the ugly—in your data lake can help with various types of analyses. 

Beyond Traditional Methods to Adaptive Data Quality

This is what the adaptive data quality capabilities in the Trifacta Data Engineering Cloud allow you to do: profile and assess data and deliver high-quality data constantly for existing, updated, and new data.

The Trifacta Data Engineering Cloud is an open and interactive platform to intelligently profile, prepare and pipeline data at any scale. The platform learns from the data itself and from user interactions to automate the most complex and time-consuming parts of data cleaning and transformation. This is achieved through a set of capabilities, we call “adaptive data quality.” 

Extending beyond traditional data quality rules, adaptive data quality enables users to easily discover and validate data quality issues.

Trifacta’s adaptive data quality techniques help interpret and understand the reliability of your data and provide intelligent suggestions to correct anomalies to ensure the profiled data is clean, accurate, and of high quality. 

Statistical data profiles are used to identify complex patterns, automatically suggesting possible quality rules such as integrity constraints, formatting patterns, and column dependencies. Trifacta automatically assesses and identifies dataset formats, schemas, attributes, and understands relationships. The platform offers transformations to consider based on classifiers for probabilistic data quality rules, allowing users to easily standardize data with support for sophisticated clustering.

These continuous data quality checks help trusted data to be consumed by downstream applications focused on analytics and AI/ML. Adaptive data quality goes a long way to improving ongoing data operations, providing self-monitoring, and in some cases, automated remediation of issues that would normally disrupt data pipelines. 

With its adaptive data quality capabilities, the Trifacta Data Engineering Cloud enables you to create an information architecture that, as Ventana Research advocates, “…includes plans to capture and retain the original data — good or bad — …[and] use this information to create data quality scorecards and set goals for improving or maintaining data quality.”

Want to get high-quality, usable data by profiling all your data? Get started with Trifacta today.