Wrangle Summit 2021 On Demand

You can still experience the best people, ideas and technology in data engineering, all in one place

Get All-Access Pass
 
All Blog Posts

February ‘19 Wrangler Release – Enhanced Standardization

February 7, 2019

Trifacta’s February ‘19 Wrangler release brings a preview of new data quality features to be released without limitation in the upcoming enterprise release in March.

New Feature Highlight:

  • Enhanced Standardization

Enhanced Standardization

Standardizing values is a way of grouping similar values into a single, consistent format. This problem is especially prevalent when working with manually entered data. These data quality issues affect reporting and machine learning as any aggregations, statistics, classifications, etc. can be skewed by the miscategorized data.

In this February release of Wrangler, we’re giving users a limited preview of the enhanced standardization capabilities we will soon be introducing more broadly to our Wrangler Pro and Enterprise editions to help with these data quality problems. With enhanced Standardization, Trifacta gives users access to multiple algorithms for grouping values and drastically improving data quality

The two different options that are presented in the enhanced Standardization menu are by string similarity and by pronunciation. Standardization using String Similarity compares strings against a combination of all values and uses either fingerprint or fingerprint ngram algorithms to cluster. You can see this in the following example of how the clusters are presented to users and how they’re able to standardize to a single, consistent format:

 

Standardization using Pronunciation leverages a double metaphone algorithm to compare values across languages by pronunciation. Determining which clustering algorithm to use depends on the scenario, but Trifacta’s enhanced standardization feature will give you the flexibility to choose depending on the context you have. Tip: You can mix-and-match algorithms. Some values may be standardized using spelling, while others are more sensibly standardized based on international pronunciation standards. You can see mixing and matching in the example below where some values are still highlighted from the string similarity example:

We are excited to hear your feedback on this new feature. To try out enhanced Standardization and all of the other great features available in our free Wrangler edition sign up here.