SQL Pipelines: ELT vs. ETL is part 2 in the Summer of SQL Series with Joe Hellerstein, professor at UC Berkeley and Co-Founder at Trifacta. For part 1, read Summer of SQL: Why It’s Back.
Q: Joe, as is true of many things in the data industry, ELT and ETL can seem like a bit of alphabet soup. What’s the simple breakdown?
A: In short, ETL stands for Extract-Transform-Load, and ELT is simply those words re-ordered.
This is pulling data from the place where it is generated. That could be any number of sources, depending on the industry. It could include an application like Salesforce or something akin to a transactional database that’s recording purchases, or it could be a collection of flat files. The Extract step is pulling any form of data from its current system and format.
This step is where data preparation and cleansing occur. Essentially, to transform the data is to make it uniform and usable. Whether this happens second or third in the process, data engineers will eventually want their data integrated into a whole that is useful for downstream analytics or machine learning.
Loading data is inputting it into a single storage environment like a data warehouse where the data can be queried and analyzed.
Then, there’s the question of order: transform before loading (ETL) or load before transforming (ELT)?
Traditionally, the transformation step occurred before loading, and as a result, the generic name for the process is ETL. This was standard practice, especially in the early days of “Big Data,” when databases were expensive to maintain and operate. For a data worker using something like Oracle back in the day, there was usually an administrator who ensured all data allowed into the database was cleansed and controlled first. In those days, the data warehouse needed to be the single source of truth; it needed to be perfect.
In the traditional ETL approach, the transformation step took place with programmatic tooling. In a modern, open-source stack today, that would be code-centric tooling based in Java or Python—tools like Spark, Pandas, or other options that primarily target software engineers.
The alternative approach, ELT, is growing in popularity. In this re-ordering of steps, the data is extracted from the source and loaded into the warehouse first. Then, transformation occurs within the warehouse.
ELT is increasingly attractive these days. Modern data warehouses are flexible and increasingly cost-effective, allowing us to store large volumes of data—even messy data that includes volumes of text and images. In this environment, transformations occur in the data warehouse, where the native language is SQL.
Compared to code-centric languages, SQL is a more targeted language for data transformation and querying, which we discussed earlier in this series.
Q: What tech trends are driving today’s shift from ETL to ELT?
A: Today, more companies choose to load data before transforming it for one key reason: cloud data warehouses. There are three major components as to why: cost, administration, and flexibility.
The cloud has made it possible to do ELT without paying the expensive costs of a legacy database vendor. Pricing for cloud data storage is often far less aggressive than that of traditional data warehouses.
Because of the nature of the cloud, businesses don’t have to hire multiple IT administrators to stand up machines to run the database. Cloud data warehouses handle the administration aspect for companies and are readily scalable as needs grow.
Cloud data warehouses enable users to run SQL queries not only on data stored in the warehouse but also on data that is stored in inexpensive cloud storage like Amazon S3, Google Cloud Storage, or the like. In this instance, data is essentially in files and not extracted but can still be queried directly. It may not be the fastest way to access data, but this availability means it’s possible even to take an “E-T” approach and still use SQL as if the process had been ELT, with data loaded into the database.
Q: With ELT, what do I do if I have really messy data in my database?
A: Don’t worry! For any serious data shop, there is messy data somewhere. It’s the inevitable challenge for our times, with the vast quantities and availability of data today. The data starts from somewhere, and the various starting places have various levels of quality, formats, and attributes.
If data comes from disparate sources, it can quickly become a mess without proper data wrangling. That mess can live in files, or that mess can live in the database, but it will need cleaning either way. People like to joke that they aren’t dealing with a “data lake” but rather a “data swamp” —the same can be true of a data warehouse for shops that embrace ELT. It’s not the architecture; it’s the data.
So the relevant questions become these: Where is the data stored, and what language will transform and govern it? Do you have the right people in place to support good processes, and does your stack fit your recruiting pipeline? E.g., if you’re targeting tools for software engineers, are you hiring a software engineering team? That brings us back to the top: ELT or ETL?