See How Data Engineering Gets Done on Our Do-It-Yourself Data Webcast Series

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

Hadoop Summit Session Preview – From Beginner to Expert: Data Wrangling for All

May 26, 2015

In a preview of Trifacta’s upcoming session at Hadoop Summit, Trifacta’s Alon Bartur previews the session “From Beginner to Expert: Data Wrangling For All” he is presenting along with Trifacta colleagues Jingshu Chen and Joe McKenney.  

The data preparation market is growing. As it does, our team at Trifacta understands the importance of building a product that can serve the needs of an ever-expanding group of customers.

When Hadoop first appeared in business environments, its initial users tended to be technical and usually had extensive computer science backgrounds. Because of their experience, they were comfortable with the challenging and often time-consuming task of preparing raw or unstructured data in Hadoop. Typically, this involved writing MapReduce code or Pig scripts to clean up and prepare data for analysis.

As Hadoop makes continued inroads into the enterprise, more and more people are anxious to use the platform to gain the sorts of insights that were never possible with traditional approaches to data management and analytics. Very often these are front-line managers and analysts with deep knowledge of their domain, but whose technical ability doesn’t extend beyond Excel or SQL. These users want direct access to the data now residing in Hadoop.

As an example, let’s say one of these users wanted to use Hadoop to better understand how customers were behaving on their company’s website. Today, they would need to first rope in a data scientist or data platform expert from IT to help them access and wrangle the messy online session data for analysis. This creates a bottleneck, and these delays have greatly hindered the ability of enterprises to benefit from their Hadoop investment.

This diversity in user background is a challenge for companies in the data preparation market to design and build for. One approach is to add optionality: include a number of ways for users to interact with the product and allow them to configure the system with how they would like to accomplish certain tasks. If poorly executed this can result in a diminished experience for users of all skill levels. However, when used correctly, it can result in a product that doesn’t slow down experienced users, supports novices, and provides enough contextual information and help to ensure that no user is ever left stranded.

At Trifacta, we’ve worked hard to build a product that supports the entire spectrum of Hadoop users. Experienced data scientists and developers can leverage Trifacta’s intelligent recommendations and workflow for enhanced productivity while still having the flexibility to write their own transformations. Less technical analysts can take advantage of various levels of abstraction to help guide them through the data wrangling process without needing to first learn a scripting language.

Bringing ease-of-use to data wrangling is a crucial step for allowing those who know the data best to directly interact with it. As data preparation tools evolve, individuals with a variety of backgrounds will be able to more easily work with data of all shapes and sizes.

In our upcoming session at Hadoop Summit, my colleagues Joe McKenney, Jingshu Chen and I will explain Trifacta’s approach to wrangling data in Hadoop. We’ll provide first-hand examples of how we’re bridging the gap between novices and experts and allowing everyone to take full advantage of the considerable power and promise of Hadoop.

For more on why data wrangling on Hadoop is the next step in the evolution of self-service analytics for business users, we encourage you to download the following report from Gartner Research – Leverage Big Data Discovery to Simplify Analytics on Big Data.