The extract, transform, and load process (ETL process), and ETL tools, have been the de facto way to move and transform data within data warehouses since the onset. It was, and for many continues to be, a necessary intermediary between data sources and data warehouse/business intelligence implementations, allowing organizations to extract a variety of datasets from their source and transform them into a standardized format for analysis.
As the big data industry evolves, however, and the demand for more diverse and complex data sources grows, traditional ETL tools have struggled to extend support for the self-service agility required by these emerging analytics use cases. Traditional ETL tools and ETL processes are extensively hardened, highly governed processes that feed critical organizational reporting. For the ETL process, this rigidity is a good thing, for it ensures that the data that lands in the eventual analytics system and databases is accurate and trustworthy.
However, these technologies were never designed to be put into the hands of the business analysts that have the most context for the data and analytics, often resulting in week or month-long lag times for IT teams to act upon the requests of their business counterparts. Despite the best efforts and intentions of vendors, it’s widely acknowledged that data preparation/ data wrangling still accounts for up to 80 percent of the effort in any analytics project and at Trifacta, we believe the only way to reduce the time spent is empowering the individuals who know the data best to do the wrangling.
In response to this changing analytics environment, ETL tools have been joined by an increasing number of new technologies and best practices for more effectively utilizing information. Born out of the “Big Data” era, data wrangling solutions have been developed to accelerate the time it takes business analysts to perform the work of transitioning raw sources into analysis-ready datasets.
Though ETL processes and tools and data wrangling solutions are complementary technologies, they are often incorrectly pitted against each other because both solutions are focused on transforming data. To help clear up some of the confusion on the differences between ETL tools and data wrangling technologies, here are three fundamental differences.
1. Data wrangling users are analysts vs. users of ETL tools are IT professionals.
This is the crux of the difference: ETL tools were built for IT professionals, whereas data wrangling was built with the end-user in mind. For example, every design decision we’ve made at Trifacta has stemmed from our core belief that those who know the data best should be the ones doing the wrangling. Analytics initiatives can’t scale when organizations relegate data preparation to a small group of highly technical (and expensive) employees; instead, this work needs to span departments and technical abilities for the most comprehensive and efficient outcomes.
2. Data wrangling solutions combine visualization, machine learning and human-computer-interaction vs. the ETL process is mapping-based.
Mapping-based ETL processes and tools are designed for IT to handle well-defined DW/BI processes, which doesn’t lend itself to non-technical users, nor encourage agility and iterative data exploration and preparation. On the other hand, data wrangling is powered by a next-gen user experience combining the latest techniques in visualization, ML and HCI so analysts can fluidly explore and prepare data. Feedback from interacting with elements of data is instantaneous, allowing users to test and learn quickly, and ultimately wrangle their data with more accuracy and speed.
3. Data wrangling solutions can handle complex, diverse data vs. ETL tools and the ETL process that mostly focuses on structured data.
Given that ETL tools were originally developed decades ago, they were architectured to handle well-defined, structured data—not the diversity and complexity that have arisen in the “Big Data” era. As data continues to grow in size and complexity, this limitation has proven challenging for organizations that want to leverage new data, and more of it. In contrast, Trifacta was specifically engineered to tackle diverse, semi-structured data of all shapes and sizes. Users can work with CSV and JSON formats simultaneously since Trifacta intuitively recognizes and breaks down complex formats into workable structures.
Final thoughts about ETL Tools vs. Data Wrangling
Despite the stark differences outlined above, it isn’t always a black and white decision to choose between data wrangling and ETL software ; in many cases, our customers use both. For one thing, abandoning ETL processes and tools isn’t an overnight process, but other organizations are simply reexamining where and when ETL solutions fit in this new landscape.
The important thing is to identify your organization’s unique needs, and understand which tool makes the most sense to fill each need. Data wrangling shines as a solution for organizations ready to extend data wrangling to non-technical users, to work with new sources or accelerate existing ETL processes, and to develop a more iterative, agile workflow for analysis. We’re proud to have become an integral component to many of our customers. To learn more about what they’re doing, check out our case studies on The Royal Bank of Scotland or PepsiCo.