Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

Data Wrangling and Visualization on a Future-Proof Platform

September 12, 2016

Trifacta and Hortonworks’ partnership is committed to accelerating the adoption of open source enterprise Apache Hadoop. With Trifacta, Hortonworks Data Platform users are able to work with their data more efficiently and productively. In anticipation of our upcoming PepsiCo webinar with Tableau, Hortonworks VP of Industry Solutions Eric Thorsen blogged on data wrangling and visualization, reposted below. Read on to learn more about the Trifacta, Tableau & Hortonworks solution.

It sounds like the Wild West, and when it comes to data, sometimes it looks like it too. The concept of “wrangling” brings to mind a lone cowboy on horseback, rounding up a herd of cattle. Back in the day, a really good wrangler could take charge of livestock, guiding them in the right direction and organizing along the way.

We can make the same analogy for data wrangling. Rather than a lone cowboy, visualize a frantic business analyst, faced with disparate data sources from multiple organizations, and a need to organize, index, and query against a variety of elements. While cows may be fickle and needing encouragement, it seems at times the work we do in Microsoft Excel resembles the same effort as cajoling or cracking a whip!

You don’t have to look far before you find a compelling need for three things:

  • Future-proof platform to collect and store all data elements from multiple sources;
  • Data Wrangling tool to wrestle and control all these data sources; and
  • An Iron-clad visualization tool to intelligently display the resulting data

These requirements exist in all industries, but they are especially highlighted in the Consumer Product Goods (CPG) industry.


Pepsico is a leader in this industry, and like other CPG companies, has distinct challenges associated with managing supply chain and product demand. CPG organizations rely on special relationships with retailers to predict and manage this concept. This collaboration provides unique insight into the forecast and replenishment of standard goods. It can make a difference when planning “buy one get one” promotions to minimizing the risk of retailers having empty shelves when consumers arrive at the store to purchase the promoted item.

This process is called Collaborative Planning, Forecast, and Replenishment (CPFR), and requires data from all participants. CPG data outlining UPC details, shelf-life, and size provide details necessary to support shipping algorithms and what space is required on the trucks delivering product. Data from retailers contain, Store codes, quantity on hand, and Point of Sales (POS) data. Additional data like Weather, Events, and driver scores can be added as well to optimize delivery routes and manage issues in the supply chain.

The CPFR process is data-intensive.  To be successful it requires a future-proof data platform to truly support all data. Whether it is structured data created by an application or data warehouse, or unstructured data collected from social media, syndicated sources, or other services. All data is combined to provide a unique view into this complex business process


In addition to the volume of this data being overwhelming, the process to manage it is as well. The sequence to combine multiple sources with different ID systems can be very manual and resource-intensive. At Strata in 2015, Matt Derda of Pepsico shared how they leveraged a series of Macros in Microsoft Access to convert customer data into Excel, which then fed a series of queries on their internal servers. Hours and Days were spent simply preparing the data. According to our partner Trifacta, over 80% of overall effort is spent preparing data vs. the true objective of analysis.


This complex business problem for CPFR was solved at Pepsico, who used Hortonworks Hadoop to store all collected data, Trifacta for a data wrangling solution, and Tableau delivering rich visualizations.


With this blend of technology, Pepsico gets faster access to reports, and truly supports the concept of “Collaborative” in their CPFR process.

They can easily and quickly import customer-provided data, combine it with internal product data, and enrich it with social media, sentiment analysis, and other unstructured data points. This combined data set is assembled very quickly, using intuitive and approachable scripting logic that provides visualization of data components, as well as potential data errors based on bad or missing characters.


A Consumer Goods company manages customers at two levels. Through their primary customer, or retailers, as well as the end-consumer. The end result, is that a CPG organization requires data at multiple levels to make business decisions.

At times, based on this CPFR process, a business decision may be to actually recommend reducing order volumes. If a retailer is ordering 500 cases per week and only consumes 100 cases per week, the right recommendation is a reduced order count in order to prevent spoilage at the end of the year. This is a byproduct of a truly collaborative relationship within the customer demand chain.