Data School

Presenting The Data School, our online resource for people who work with data

Learn More
 
All Blog Posts

Partnering for Transformation in Hadoop

March 18, 2014

As the Hadoop ecosystem has transitioned from an open source experiment to an enterprise solution, it has begun to have significant impact on the way professional data analysts do their work. Hadoop is often looked at as a set of technologies. Equally important is its effect on the analytic process: it is a platform that encourages agility and scale in data analysis.

To start, Hadoop’s storage is built on a simple filesystem model, rather than a complex data model. Questions of data modeling are postponed to time of use, making Hadoop a friction-free environment to store data at remarkable scale. This is perfect for the quantified world we live in: anything you can record or measure can be easily captured in storage without worrying about formats or schemas. Ease of data capture ensures that no data is lost, and the opportunity for subsequent analysis is maximized.

Data in the Hadoop File System (HDFS) need not stay raw; it can be transformed and manipulated as needed into multiple structures and schemas, which can support different users with different business purposes. Once data is transformed for a particular use case, a growing variety of technologies in the Hadoop ecosystem are available to perform analysis in both batch and interactive modes. The flexibility of transformation into scenario-specific formats encourages analysts to be far more agile and creative in the questions they ask than was typical with traditional enterprise data management software.

Addressing the Transformation Bottleneck

Unlike HDFS, data analysis software is rather demanding of its inputs: visualization tools and predictive analytics both require cleaned, well-structured data. Hadoop’s flexibility at time of data capture does not remove the need to do this work; it simply postpones it to the time of analysis, when it can be done in a targeted fashion. This is an area where the Hadoop ecosystem is still evolving: many organizations can get a fast start with HDFS as a storage platform, but they have had to recruit people with deep technical skills to leverage the power of the platform to analyze raw data. Trifacta entered the Hadoop market to accelerate this aspect of the platform’s evolution, and promote true end-to-end agility in analytic data management.

Our goal here is twofold. For technical data engineers and data scientists, we believe Trifacta can radically reduce the time-to-value in data analysis, speeding analysts through data cleaning and transformation, and encouraging them to iteratively develop new models and insights over a wider variety of data. In addition, we want to enable business data analysts to get “onto the playing field”, making it easy for them to discover raw data at scale, structure and extract key features, and transform the results down to a size that can be loaded into visualization and analysis packages.

 Of course this isn’t only a challenge in the Hadoop context. The traditional analytic data processing pipeline—whether it is on Hadoop, databases or desktop files—has long been a slow, manual process. But with Trifacta’s technology we’re changing that. We’re using a combination of machine learning algorithms and new human-computer interaction design to elevate the experience of working with raw data—moving from manual, tedious data preparation to a new approach to Data Transformation.

 Serious Transformation at Scale: A Partnership

When we started Trifacta our vision was to build a Data Transformation platform that could scale up to the volume and variety seen in production Hadoop implementations. This didn’t preclude us from solving the human-interaction problem for smaller data sets. But we were determined that any solution we would develop must have the power to solve data transformation for terabytes to petabytes of data—exactly the environments where Hadoop is most effective. We believe that Trifacta’s combination of an intelligent interface and limitless scalability truly differentiate it from other approaches in the market.

Today we’re excited to announce a strategic partnership with Cloudera that includes not only the certification of Trifacta on Cloudera’s Distribution of Hadoop (CDH), but continued joint development and solution delivery to Cloudera customers. It’s a milestone of delivery against that initial vision that Jeff, Sean and I had for the company. Together with Cloudera we’re focused on lowering the barriers for capturing and leveraging all the world’s data, driving down the time to analysis, and transforming the potential of Big Data into concrete value within the enterprise.

Recently Amr Awadallah and I had a chance to discuss the evolution of Hadoop, what Trifacta means to Cloudera customers and the emerging space of Data Transformation. If you have a few minutes, take a look at some of the highlights from our conversation.