Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Free Data Cleaning for Amazon Redshift

Improve Your Cloud Initiatives with Wrangler Pro on Amazon Redshift

Free Trial
Upcoming Forrester Webinar

Best Practices for Scaling Data Preparation for ML and AI

Register
Schedule a Demo

Product

February ‘19 Wrangler Release – Enhanced Standardization

< Back to Blog
 
February 7, 2019

Trifacta’s February ‘19 Wrangler release brings a preview of new data quality features to be released without limitation in the upcoming enterprise release in March.

New Feature Highlight:

  • Enhanced Standardization

Enhanced Standardization

Standardizing values is a way of grouping similar values into a single, consistent format. This problem is especially prevalent when working with manually entered data. These data quality issues affect reporting and machine learning as any aggregations, statistics, classifications, etc. can be skewed by the miscategorized data.

In this February release of Wrangler, we’re giving users a limited preview of the enhanced standardization capabilities we will soon be introducing more broadly to our Wrangler Pro and Enterprise editions to help with these data quality problems. With enhanced Standardization, Trifacta gives users access to multiple algorithms for grouping values and drastically improving data quality

The two different options that are presented in the enhanced Standardization menu are by string similarity and by pronunciation. Standardization using String Similarity compares strings against a combination of all values and uses either fingerprint or fingerprint ngram algorithms to cluster. You can see this in the following example of how the clusters are presented to users and how they’re able to standardize to a single, consistent format:

 

Standardization using Pronunciation leverages a double metaphone algorithm to compare values across languages by pronunciation. Determining which clustering algorithm to use depends on the scenario, but Trifacta’s enhanced standardization feature will give you the flexibility to choose depending on the context you have. Tip: You can mix-and-match algorithms. Some values may be standardized using spelling, while others are more sensibly standardized based on international pronunciation standards. You can see mixing and matching in the example below where some values are still highlighted from the string similarity example:

We are excited to hear your feedback on this new feature. To try out enhanced Standardization and all of the other great features available in our free Wrangler edition sign up here.

For the full February ‘19 release notes as well as past months notes, click here.