Don't miss Inspire 2024, taking place May 13 - 16, 2024 at the Venetian, Las Vegas. Register Now.

 

4 Key Steps for a Data Sanity Check

Strategy   |   Alteryx   |   Nov 17, 2021

As a Customer Success Manager at Alteryx (formerly Trifacta), I spend most of my time helping our customers wrangle their raw, big data into business insights. On these data wrangling projects, it’s tempting to jump straight into the most interesting problems but to produce the most accurate results, we should start by performing a set of basic data quality validations. We call these initial data quality validations “sanity checks.”

Running sanity checks is an important (and often neglected) part of the data wrangling process. Your final analysis is only as accurate as your data, which means it’s more than worth it to take a few minutes to validate your data’s accuracy and completeness. A quick sanity test goes a long way in accurate analysis. 

At Alteryx, we believe so strongly in sanity checking that we’ve incorporated robust profiling into our own products. Read on to see how I use Designer Cloud to perform four key sanity checks during customer projects.

1. Take a random sample of the data.

Why this sanity test is useful

Often, the datasets that I’m working with are too large to easily assess as a complete source. Looking only at a consecutive sample of that data—for example, the first 1000 rows—can give me an incomplete understanding of the contents of a dataset. By generating a random sample over the entirety of the dataset, I get a more accurate picture of the full dataset.

Alteryx Designer Cloud allows you to swap between a sample of the first rows of a dataset and a random sample from the complete dataset.

Real-life sanity checking

Last week, I was showing a customer how to wrangle a set of machine-generated logs containing product usage information. Since each product sent back log data on a regular basis, the data volume was absolutely huge—there were thousands of individual log files, each with a size of around 2GB. This meant that we wouldn’t be able to easily examine the complete dataset at once.

After loading the data into Designer Cloud, we decided to compare the first rows sample with the random sample to validate that the structure of the logs remained consistent throughout the dataset. The following screenshots show what we found (note: the screenshots do not include actual customer data):

Data in the first rows sample. Note the number of categories shown in the size1 column.

Data in the random sample. Note the number of categories shown in the size1 column.

You can see that the number of categories, or unique strings, increased when we switched to a random sample. We dug a bit deeper into the differences in the size1 column and noticed the following:

 

The size1 column sometimes contained an array of additional values! To make sure that the data was structured correctly, we needed to extract this array into a new column. Examining a random sample of our data during a sanity test allowed us to surface this data quality issue and then take steps to address it.

2. Check for datatype mismatches, variations in how values are entered, and missing values.

Why this sanity test is useful

Effective downstream analysis requires consistency. You can’t easily understand the relationships between events if some of those events have dates formatted as yyyy/MM/dd and some events have dates formatted as dd/MM/yyyy. Similarly, some systems may store prices in strings ($1,000) while some systems store prices in decimals (1000.00).

How it works in Designer Cloud 

Designer Cloud automatically profile your data and provide information about the number of valid, missing, and mismatched values in each column. The results of this profiling is shown in the data quality bar above each column in the dataset:

3. Look for duplicate records and outliers.

Why this sanity test is useful

Both duplicate records and outliers can distort my analysis, so I need to assess the overall quality of the dataset and determine if it contains any duplicate records or outliers.

How it works in Designer Cloud 

When you’re working with your data in Designer Cloud, you can switch into the column details view, which displays high-level summary statistics about the contents of each column in your dataset. I’ve taken a screenshot of the column details view for a column in one of our demo datasets:

 

This column contains the average satisfaction scores for customer service agents. In the column details view, we can quickly see summary statistics about the data, including the number of unique values in the column and any outliers. Our demo dataset contains some outlier values—it looks like there are two agents with particularly low average satisfaction scores and one agent with a very high score. Now that I’ve identified the outliers, I can make a decision about whether or not to exclude those values from my analysis.

Real-life sanity checking

One of our customers has been constructing an inventory forecast dashboard that combines data from multiple retailers. Making sure the data used in the dashboard is crucial, especially since any inaccurate or duplicated data in the final dashboard could have a direct impact on our customer’s bottom line. We performed sanity tests at the beginning of the data wrangling process and discovered that a number of duplicate records had crept into the source data. We were able to resolve this issue and create a highly accurate inventory forecast for upper management to use when collaborating on promotional campaigns with their retail clients.

4. Assess the data distribution for each column.

Why this sanity test is useful

Sometimes the data in a column may look good on the surface—there are no obvious datatype mismatches, duplicate records, outliers, or null values—but if you examine the distribution of data in the column, you notice gaps or a values distribution that doesn’t make logical sense. An odd data distribution may indicate a larger data quality issue that I need to investigate: maybe some data points were not recorded, or the data was not encoded correctly.

How it works in Designer Cloud  

The column details view (discussed earlier) shows you a detailed histogram of value distributions, as well as summary statistics like the standard deviation, median, and average of the values in each column. If you want to quickly check the data distribution without opening the column details view, you can also look at the small histograms shown above your data. You can see examples of these histograms in the screenshot below:

 

Real-life sanity checking

Recently, I was working on a project that used raw interaction data to improve the experience on one of our customer’s websites. We were looking at a set of website visit logs. At first glance, the date/time stamps for each website visit looked valid. However, when we ran a basic sanity check on that data and assessed the distribution of values, we noticed that most of the website visits had occurred between midnight and 3am. Since it seemed unlikely that a majority of the website visitors took care of their online tasks in the middle of the night, there was clearly a problem with the data!

We investigated the process that was providing us with the weblogs, and discovered that the upstream system was truncating any trailing zeroes that appeared in the timestamps. So a timestamp that was really ‘175400’ (17:54.00) was appearing as ‘1754’ (00:17.54). Once we adjusted the data to account for the trailing zero truncation, the data distribution looked much more reasonable.

The bottom line:

It might be easy to overlook sanity checks, but doing so can have a real impact on your data. Without taking care to understand the complete contents of your data, checking for mismatched or duplicate data, and assessing the distribution across each column, you’ll end up with a skewed analysis—which might be more difficult to untangle than running a sanity test in the first place.

Our support portal has articles, videos and a forum for you to combat your data wrangling challenges, such as how to view mismatched data. To get started with sanity tests, sign up for Designer Cloud today!