Cloudera’s announcement this morning highlighted the opportunity for Hadoop to have a significant impact on data processing in the enterprise as an Enterprise Data Hub, the central source of data in the enterprise. Trifacta was named along with a large and growing ecosystem of over 100 solutions that are certified with the latest Cloudera distribution of Hadoop, Cloudera Enterprise 5.
With the ability to ingest vast amounts of unstructured data, Hadoop is a technology well suited to the role of a data hub. It is arguably a better technology for storing new data formats like log and web files that aren’t well suited to storing in more structured, traditional relational databases. More recently, Hadoop has matured to leverage structured data as well, with technologies such as Impala and Hive. But how will Hadoop evolve to support the promise of a hub for not just data but as a hub for enterprise analysis? Providing a hub for enterprise analysis means looking beyond an IT-centric view of Hadoop as a storage layer only. It means looking to the data scientists and analysts who aim to draw business value out of their data. To accomplish their goals, they require Hadoop to deliver both storage and responsive data analysis.
In my previous jobs, as a research scientist at Yahoo! and an engineer at LinkedIn, I spent a fair amount of time working with Hadoop and different datasets (web logs, metadata, etc.). Any time I got my hands on a new data set I began with the same steps most of us employ: grab a sample of the data, parse it into columns, hone in the relevant parts, and develop hypotheses.
All of this sounds smooth, fun and…linear. The astute reader may notice one of my steps: working on a sample. That’s a standard step in the Big Data era: work quickly with a small sample and then apply that work over the whole data set. That’s where things often slow down and my approach becomes decidedly super-linear. Responsiveness becomes an issue. Even at big companies with large Hadoop clusters, it still takes hours to work on terabytes of data. Run the job, wait, and hope for transformed and understandable results. That almost never happens! Instead, we find the job has failed due to memory errors. Or if the job was successful is full of nonsense: columns with anomalies not present in our sample, joins that contain no results, etc. When output is large, it may require building post-processing analyses just to detect these problems. All of this forces the analyst to iterate over the data again and again. In the best case, even if we make quick work of the problems as we find them, we still have a lot of idle time waiting for jobs to complete. What seemed like a half-day project can drag out for days.
At Trifacta, we are focused on improving these day to day challenges for the data analyst- areas where the experience of working with Hadoop is not yet as responsive as it could be. Today one of the key advantages customers get from Trifacta’s Data Transformation Platform is the ability to build a reusable script simply by manipulating example data directly. The system suggests transforms that can be previewed. Suggested transforms can be immediately accepted into the script or analysts can write their own transformations if they have something in mind. In all cases they are building a high-level declarative query plan (in database terms, they say ‘what’ they want and not ‘how’ to get it). This enables them to work quickly. It also lets Trifacta do a lot of work behind the scenes.
As the user builds their script Trifacta simultaneously fetches from their data sophisticated samples and surfaces them. The anomalies that used to be an unpleasant surprise now appear early in the transformation process. When the user runs their job at scale, we really flex the compiler’s muscles. Trifacta generates a query execution plan to produce their output and automatically adds execution steps to produce all the data artifacts presented on the Job Results page. These include basic information about the result (size, number of rows, etc.) but also statistical summaries of every output column and example anomalies. The identification of anomalies includes the filtered list of each specific output row, presented in a way that the user can easily select from to bring back into the system for further transformation.
One of the major challenges that the Trifacta engineering team faced leading up to our latest release was developing appropriate techniques to produce these artifacts at the terabyte to petabyte scale across the variety of use cases our customers bring to Trifacta. And we succeeded. The result? We do the heavy lifting and our users benefit from a much deeper understanding of what their transformation job has produced.
With Cloudera leading the charge to leverage new volumes of data in the enterprise, we’re hoping that analysts can embrace that new volume with tools like Trifacta that are purpose built for data discovery and exploration. In the era of Hadoop structuring data is more nuanced than in the past. Each data analyst brings his or her own context to the data. We aim to make the data analysts’ life easier by increasing their productivity while also providing data processing that takes full advantage of the agility and flexibility of Hadoop’s unique approach to data storage. If that combination takes hold we’re likely to see the full business benefit of Hadoop in the enterprise with more responsive data analysis.