With the release of Hadoop 2.0, the Hadoop ecosystem is undergoing a major transition. While Hadoop 1.0 provided one general-purpose programming model operating on semi-structured data, Hadoop 2.0 allows for a much larger ecosystem of special-purpose systems, many of which require structured data. It’s worth reflecting on how this transition came about, and the new ecosystem’s critical need for data transformation platforms like Trifacta.
The Past: Big Data = MapReduce
It’s been almost a decade since Google published the designs of MapReduce and the Google Filesystem, providing the inspiration for Hadoop and sparking an explosion of interest in scale-out data processing on commodity hardware.
For several years after its introduction, people tried fitting as much into the MapReduce programming model as possible. As time went on, however, the limited expressive power of a single MapReduce job quickly became clear. Power-users responded to these limitations by constructing a plethora of higher-level languages and workflow schedulers. The goal of many of these efforts was to ease the burden of plumbing together increasingly-complicated MapReduce workflows. Algorithms that were inefficient when implemented as MapReduce jobs were often deemed “good enough”, since previous tools had trouble processing data at scale. As Hadoop’s adoption increased, it became clear that while MapReduce is powerful, it’s not a panacea.
Toward the Right Tool for Every Job
Thankfully, in recent years several powerful new tools have been added to the Big Data toolbox. YARN decouples MapReduce from Hadoop’s cluster management system. This decoupling allows sophisticated analytics tools like Spark and massively parallel databases like Impala to coexist with MapReduce on Hadoop clusters despite their vastly different architectures. The increasing adoption of HDFS allows these diverse systems to utilize a common storage layer. We are entering an era where practitioners will be able to choose the software tool best suited to each processing or analysis task, rather than shoehorning algorithms into MapReduce. However, in order for this new Big Data ecosystem to flourish, we still have to bridge a significant gap.
MapReduce is great at dealing with batch transformation and aggregation of semi-structured data, but many next-generation tools (particularly databases like Impala) require much more structure. Bridging this gap means transforming semi-structured data into structured data. If this task remains the domain of a small group of experts writing complex transformation workflows by hand, the cost of migration to these next-gen tools will be too high. There is an urgent need for a user-friendly and powerful data transformation tool that allows people who aren’t Big Data experts to bridge the gap.
We believe that Trifacta provides that user-friendly bridge. As a backend engineer at Trifacta, I help make our platform scalable, performant and reliable. It is an incredibly complicated and challenging problem, and one that we’re really excited to be working on.