Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

4 Key Steps to Sanity Checking Your Data

November 17, 2015

As a Customer Success Manager at Trifacta, I spend most of my time helping our customers wrangle their raw, big data into business insights. On these data wrangling projects, it’s tempting to jump straight into the most interesting problems, but to produce the most accurate results, we should start by performing a set of basic data quality validations. We call these initial data quality validations “sanity checks.”

Running sanity checks is an important (and often neglected) part of the data wrangling process. Your final analysis is only as accurate as your data, which means it’s more than worth it to take a few minutes to validate your data’s accuracy and completeness.

At Trifacta, we believe so strongly in sanity checking that we’ve incorporated robust profiling into our own products. Read on to see how I use Trifacta to perform four key sanity checks during customer projects.

1. Take a random sample of the data.

Why it’s useful

Often, the datasets that I’m working with are too large to easily assess as a complete source. Looking only at a consecutive sample of that data—for example, the first 1000 rows—can give me an incomplete understanding of the contents of a dataset. By generating a random sample over the entirety of the dataset, I get a more accurate picture of the full dataset.

How it works in Trifacta

Trifacta Wrangler Enterprise allows you to swap between a sample of the first rows of a dataset and a random sample from the complete dataset.

Real-life sanity checking

Last week, I was showing a customer how to wrangle a set of machine-generated logs containing product usage information. Since each product sent back log data on a regular basis, the data volume was absolutely huge—there were thousands of individual log files, each with a size of around 2GB. This meant that we wouldn’t be able to easily examine the complete dataset at once.

After loading the data into Trifacta, we decided to compare the first rows sample with the random sample to validate that the structure of the logs remained consistent throughout the dataset. The following screenshots show what we found (note: the screenshots do not include actual customer data):

first_sample

Data in the first rows sample. Note the number of categories shown in the size1 column.

random_sample

Data in the random sample. Note the number of categories shown in the size1 column.

You can see that the number of categories, or unique strings, increased when we switched to a random sample. We dug a bit deeper into the differences in the size1 column, and noticed the following:

expanded_details

The size1 column sometimes contained an array of additional values! To make sure that the data was structured correctly, we needed to extract this array into a new column. Examining a random sample of our data allowed us to surface this data quality issue and then take steps to address it.

2. Check for datatype mismatches, variations in how values are entered, and missing values.

Why it’s useful

Effective downstream analysis requires consistency. You can’t easily understand the relationships between events if some of those events have dates formatted as yyyy/MM/dd and some events have dates formatted as dd/MM/yyyy. Similarly, some systems may store prices in strings ($1,000) while some systems store prices in decimals (1000.00).

How it works in Trifacta

Trifacta Wrangler and Wrangler Enterprise automatically profile your data and provide information about the number of valid, missing, and mismatched values in each column. The results of this profiling is shown in the data quality bar above each column in the dataset:

Data quality bar
3. Look for duplicate records and outliers.

Why it’s useful

Both duplicate records and outliers can distort my analysis, so I need to assess the overall quality of the dataset and determine if it contains any duplicate records or outliers.

How it works in Trifacta

When you’re working with your data in Trifacta Wrangler or Wrangler Enterprise, you can switch into the column details view, which displays high-level summary statistics about the contents of each column in your dataset. I’ve taken a screenshot of the column details view for a column in one of our demo datasets:

column_details

This column contains the average satisfaction scores for customer service agents. In the column details view, we can quickly see summary statistics about the data, including the number of unique values in the column and any outliers. Our demo dataset contains some outlier values—it looks like there are two agents with particularly low average satisfaction scores, and one agent with a very high score. Now that I’ve identified the outliers, I can make a decision about whether or not to exclude those values from my analysis.

Real-life sanity checking

One of our customers has been constructing an inventory forecast dashboard that combines data from multiple retailers. Making sure the data used in the dashboard is crucial, especially since any inaccurate or duplicated data in the final dashboard could have a direct impact on our customer’s bottom line. We performed sanity checks at the beginning of the data wrangling process and discovered that a number of duplicate records had crept into the source data. We were able to resolve this issue and create a highly accurate inventory forecast for upper management to use when collaborating on promotional campaigns with their retail clients.

4. Assess the data distribution for each column.

Why it’s useful

Sometimes the data in a column may look good on the surface—there are no obvious datatype mismatches, duplicate records, outliers, or null values—but if you examine the distribution of data in the column, you notice gaps or a values distribution that doesn’t make logical sense. An odd data distribution may indicate a larger data quality issue that I need to investigate: maybe some data points were not recorded, or the data was not encoded correctly.

How it works in Trifacta

The column details view (discussed earlier) shows you a detailed histogram of value distributions, as well as summary statistics like the standard deviation, median, and average of the values in each column. If you want to quickly check the data distribution without opening the column details view, you can also look at the small histograms shown above your data. You can see examples of these histograms in the screenshot below:

histograms

Real-life sanity checking

Recently, I was working on a project that used raw interaction data to improve the experience on one of our customer’s websites. We were looking at a set of website visit logs. At first glance, the date/time stamps for each website visit looked valid. However, when we ran a basic sanity check on that data and assessed the distribution of values, we noticed that most of the website visits had occurred between midnight and 3am. Since it seemed unlikely that a majority of the website visitors took care of their online tasks in the middle of the night, there was clearly a problem with the data!

We investigated the process that was providing us with the weblogs, and discovered that the upstream system was truncating any trailing zeroes that appeared in the timestamps. So a timestamp that was really ‘175400’ (17:54.00) was appearing as ‘1754’ (00:17.54). Once we adjusted the data to account for the trailing zero truncation, the data distribution looked much more reasonable.

The bottom line:

It might be easy to overlook sanity checks, but doing so can have a real impact on your data. Without taking care to understand the complete contents of your data, checking for mismatched or duplicate data, and assessing the distribution across each column, you’ll end up with a skewed analysis—which might be more difficult to untangle than running a sanity check in the first place.

Check out this blog to meet the rest of our Customer Success team. Our support portal has articles, videos and a forum for you to combat your data wrangling challenges, such as how to view mismatched data. Sign up for Trifacta Wrangler to start wrangling today!

Related Posts

Expansion and Innovation in Wrangling: Reflection on 2016

As 2016 comes to a close and we head into the New Year, we’re taking a moment to reflect on some of the... more

  |  December 29, 2016

Trifacta Data Preparation Training & Certification Is Here

Accelerate your learning and put your skills to the test with Trifacta’s new training and certification... more

  |  May 29, 2018

Teaching Data Prep Concepts in the Classroom

Trifacta Wrangler, our free personal edition of Trifacta, is being used at hundreds of universities around... more

  |  June 1, 2018