Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

The 6 Steps of Wrangling Insurance Data

December 14, 2015

In this post, Cindy Maike, General Manager for Insurance Solutions at Hortonworks and Paige Schaefer, Product Marketing Associate at Trifacta, teamed up to discuss Big Data Wrangling in the insurance industry. Trifacta is a Hortonworks Certified Technology partner and most recently received the Hortonworks Industry Certification for Retail and Insurance verticals. To learn more about our partnership, check out our Partners page on the Trifacta or Hortonworks website.

The insurance industry is wrestling with the tremendous growth of data sources at its disposal. Traditional ETL processes are expensive, time-consuming, and complicated by the variety of data structures and formats. In contrast, Hadoop platforms provide a clean, safe, and manageable format for data wrangling, the critical first step of the data analysis process.

Forward-thinking insurance companies have embraced data wrangling as more than janitorial work. For them, wrangling is just as important a component as the final results. Properly executed, wrangling provides data insights that improve both analytical inquiries and the quality of the results.

In this post, we look at the six steps of wrangling data according to Trifacta. We then look at how each step applies to the insurance industry.

1.) Discovering

“Discovering is something of an umbrella term for the entire process; in it, you learn what is in your data and what might be the best approach for productive analytic explorations.”

In the insurance world, it’s extremely important to have a more accurate understanding of available data. For example, a common term such as “household” is dynamic, given the growth of millennials living with their parents. Accordingly, the lifetime value of a household is now a better metric than the lifetime value of a customer. Discoveries such as these give us a good understanding of how to proceed.

2.) Structuring

“Structuring is needed because data comes in all shapes and sizes.”

Today, the insurance industry is able to pull data from not only structured sources, but unstructured sources as well. Call center transcripts, or unstructured text, can indicate whether a customer has problems with their current policies. These transcripts are an important data point, but it’s necessary to find a self-service tool, such as Trifacta, that offers an efficient way to structure this type of unstructured data for use.

3.) Cleaning

“Cleaning involves taking out data that might distort the analysis, such as a null value.”

Data cleaning involves more than reformatting null values or fields: it’s also an opportunity to validate the trustworthiness of the data. Yet regulatory compliance within the insurance industry demands a complete data lineage. Trifacta enables this lineage by tracking all changes from raw data to completion.

4.) Enriching

“Enriching allows you to ask questions about other data that might be useful in the analysis, or new data that you can derive from existing data.”

Insurance data can be enriched by joining it with other data. For example, the what, when, and how of individual product purchases can be paired with marketing data. Once joined, the effectiveness of outreach efforts can be gauged, and tailored marketing efforts can target customers with greater success.

5.) Validating

“Validating is the activity that surfaces data quality and consistency issues, or verifies that they have been properly addressed by applied transformations.”

When reviewing a total risk profile, it’s important to fact-check all known variables. The number of policies with a customer and subsidiary, for instance, should be contrasted with the exact timeline of changing systems. By validating the data, it’s possible to begin visualizing how all of these elements work together, and to ensure the data has been validated.

6.) Publishing

“Publishing delivers the output of data wrangling efforts for downstream project needs.”

Publishing enables use of wrangled data, whether by loading the data in a particular analysis package, or by documenting the transformation logic for future needs. In the insurance industry, it’s extremely important that wrangled data is published for the actuarial department, or for downstream business users that will do the analysis for risk analysis and underwriting, claims, or customer analytics.


As the insurance industry relies on Hadoop to support new analytical endeavors, proper data wrangling should remain a critical component of the process. Without it, opportunities for enhanced risk selection, customer growth, and the mitigation of claim leakage are lost. Data wrangling is a key process that explores all possible data, thereby providing the best possible understanding of it.

To learn more download our Six Steps to Preparing Insurance Data ebook.