We’ve talked about why SQL is back, ELT, and SQL for data transformation in this series. It’s clear data preparation and data wrangling are evolving.
As a professor at UC Berkeley, you taught SQL for data engineering this year in a brand new class format. How was that different than your old databases classes?
Berkeley has a massive and growing data science program—thousands of students per year. As we’ve been evolving the curriculum, it’s become clear that we need to target an entire course to data engineering. Now, I’ve been teaching database courses to budding software engineers in the CS department at UC Berkeley for 25 years. But the goals of data engineering are different than software engineering. So I got together with my colleague Aditya Parameswaran to develop a new course. It was interesting to think through the curriculum and how we need to teach SQL differently than in our established database classes.
When teaching SQL in a typical database class, we focus on simple single-shot queries that correspond to typical business analytics. That includes basic SQL functionality for filtering data, joining tables, counting, grouping… the sorts of queries that generate reports or dashboards.
In this new approach, we ended up emphasizing four steps to “SQL for Data Engineering” that go well beyond a traditional database class: data quality, data reshaping, what I call “spreadsheet tasks,” and data pipeline testing.
How would data quality typically be taught, and why is it different in SQL for data engineering?
Sadly, data quality assessments are mostly overlooked in traditional database classes. At best, we talk about SQL constraints: rules to prevent data from being entered if it’s not “as expected.” The thing is, the real world is full of data that we need to load despite it being full of unexpected data quality errors. Data engineers need to learn how to assess the quality of that unfamiliar data as a first step in transforming data for downstream use.
To break it down:
This is where we start. After extracting and loading data, we need to look at it: what’s there and how’s it shaped? We teach students to write queries that return what you might get from handy bar charts—you know, break each column of a table into bins and count the number of records per bin. This can get tricky with dirty data that’s a hodgepodge of strings and numbers: like when a person’s listed age as “newborn.”
A second issue is standardization, or what’s sometimes called entity resolution. Does your data have multiple names for the same real-world item—say “Apple iPhone 12 mini” and “5G iPhone 12 (mini)”? How can you know when that’s happening? What can you do to standardize on canonical names? A good data engineer needs to understand how to probe for these data quality problems. Students learn relevant techniques from text search, like using SQL to compute how different two strings are in terms of spelling or sound. Then, they use that to rank-order likely matches.
Next, we teach how to take a statistical sample of a large table and use it properly. Then, any remaining quality checks can go quickly. It’s funny—we teach sampling first thing in our data science classes, but it’s rare to cover it in a database class. It’s such a basic tool for dealing with large data! Beyond simple random samples, we teach advanced techniques like stratified sampling, too. These are basic tools for statistical data science, but surprisingly few people know how to implement these methods on large datasets.
Overall, the treatment of SQL in the data engineering class is far more holistic than in the traditional database class. And that begins with getting a sense of what’s in your unruly data.
You also mentioned reshaping, spreadsheet tasks, and data tests as the next core topics in SQL for Data Engineering. Can you break these down?
With data quality established, the next area we teach data engineers is to reshape the data with SQL.
When you’re doing data transformation—especially with data loaded from multiple sources—the rows and columns are typically not in the desired format for downstream work. The data may not even be very schematized. Maybe you have a big field of unstructured text that needs tidbits extracted into new structured columns.
In this reshaping step, we teach students how to wrangle all the tricky components of data representation, like extracting structure from text, normalizing tables, converting tables into matrices (tensors), and vice versa, dealing with nested data from sources like JSON, etc. These skills are different than skills taught in traditional database classes.
The third step in teaching SQL for data engineering is what I like to call “spreadsheet tasks.” These things might be pretty natural to do as a one-off in Excel but turn out to be more or less tricky on full tables in SQL—they usually involve SQL’s “window” queries, something that’s often considered esoteric in a database class.
One of my favorite examples is filling in missing values. Picture this: you have a spreadsheet with a column for “Temperature.” You have “70” in one cell, then a gap of 10 empty cells below that, then “60” below the gap. What should you fill in the missing cells? You could copy the 70 down into the empty cells, copy the 60 up into the empty cells, or fill the empty range with gradually decreasing values: 69, 68, 67, and so on. This is what data scientists call “data imputation.” It’s is necessary for downstream analytics that don’t do well with missing values, including many machine learning algorithms.
Well, this is pretty easy to do in Excel if you’re patching up a single hole in a spreadsheet. But it’s surprisingly complex to write as a general-purpose fix in SQL. A host of ordered operations like this make sense in a spreadsheet’s “grid view” but are tricky to generalize to robust SQL queries.
Another SQL pattern we teach in data engineering is testing for correctness and maintainability.
In data engineering, an ELT pipeline will have dozens of queries in it. The data that’s fed into the start of the pipeline is subject to change over time. So, a transformation pipeline is actually quite a lot of SQL code. In our computer science classes, we teach our budding software engineers to write tests for code in languages like Java and Python as a matter of course. We also need to teach our budding data engineers to write tests for data in their SQL pipelines.
How do you test the result of a SQL query? Broken queries often do not produce error messages. After all, they provide either no answers or wrong answers. A test query can be as simple as checking that the output of a given step is non-empty or has the expected number of rows. Or, we might want more specific checks based on common assumptions about the data. For example, we should not see a state name that we’ve never seen before. Or, the value in the refund column should not exceed the value in the price column, etc.
The most important tests are often the ones that validate the data transformations we write. Recall our example of filling in missing temperature data. We might want a test to ensure that we don’t invent any values by accident. For example, we might check that imputed values always fall in the range between the values that come before and after them.
With long pipelines consisting of many complex SQL queries, it’s essential to have these kinds of tests all along the way for reliability and maintenance.
We have covered a lot of territory in this Summer of SQL series. Where can folks go next for more information?
One good place to go is the website from our Berkeley Data Engineering course last spring. We have lecture notes and some interactive Jupyter notebooks as well. Keep in mind it’s a snapshot of the first version of that class. We have more content and improvements in the works!
Another resource I can recommend is to try out Trifacta in the cloud. Trifacta is an excellent environment for improving skills because you can move seamlessly from visual “no-code” work to SQL and back. It’s a visual environment where you can keep your eyes on the data and data quality indicators. Think of it like a spreadsheet supercharged for data engineering.
If you manipulate the data visually, Trifacta will write the corresponding SQL for you to read. It even has AI agents to recommend what to focus on in your data and how to transform it. At the same time, it’s coder-friendly: you can write your own SQL and incorporate it into the visual flows. That fluid “round-trip” from visual to code and back will grow your skillset gradually—and collaborate with folks whose skills may be different than yours.
Regardless of what tools you choose, let me close out by saying this: Always be looking at your data visually while you do your data engineering! Remember, data engineering is not software engineering. The code is not your product—the data that comes out of the pipeline is your product!
So, after our summer of SQL, let’s remind ourselves: in the end, it’s all about the data.