Just a few weeks into 2022, we’re already learning about what the year has in store for us. It’s not too late to take a look back, and ahead. In a recent episode of The Data Wranglers podcast, Joe Hellerstein and I did just that. We identified three drivers of changes in how we look at data, and it’s worth bringing these thoughts to the page, too. (Looking for the episode? Listen here.)
These were our three big takeaways from data in 2021:
- The continued rise of the cloud
- Issues in data ethics
- Continued COVID Pandemic
The Cloud: Not Just Old News
In late 2021, many saw a small spat online between Snowflake and Databricks about which product was better, TPC benchmark results, value, and so forth. It was generally polite and in good spirit, but it does highlight an interesting point. Of course, many people see value in both systems, and that’s somewhat unspoken. There is an assumption that all modern data instances are running in the cloud, and that the cloud is old news because this was predicted years ago. We’ve seen the rise of cloud computing as the place where data work gets done.
But I think this year, we saw even more of the industry and a lot more government services continuing to move to the cloud. This transition brings many benefits, particularly the flexibility to pick best of breed tools. There is newfound freedom to select the best solution for something like data cleaning, moving data, etc., whether the data is in Google Big Query, Snowflake, Databricks, or other databases. Then, there are reporting and visualization options with tools like Tableau and Looker, ultimately allowing organizations to put together the best custom data environment.
Snowflake cofounder Benoit Dageville spoke about their vision of the Data Cloud at the Wrangle Summit, the industry’s first data engineering cloud conference, hosted by Trifacta. One thing Benoit said that really struck us is his slogan that “In the cloud, fast is free.”
His anecdote for that was they had a customer who used to run a big job on their on-premise database and took the whole weekend to run. When the customer switched to Snowflake, they could spend the same amount of money on roughly the same amount of machines to run it over the weekend that they did on-prem or they could spend that very same amount of money for way more machines to run it in an hour. The cost was the same. It’s just the latency. In the cloud, fast is free.
What about visualization tools in the cloud? Visualization is my area of study, and in terms of the major approaches to basic exploratory visualization and reporting, the models that have worked in previous years continue to be valuable. Now, though, instead of hitting an on-prem database, you might be more likely to be pulling from your cloud database instead. Thus, visualization features such as the ability to publish reports are following suit in terms of moving into the modern environment.
There is potential for more interesting applications in looking at more domain-specific applications of visualization. This would be going beyond the standard BI and reporting world. There are a number of young companies that have been making progress on that front, particularly when it comes to machine learning workloads and how to make sense of all the training data and test data to monitor for potential bias and other data quality issues. This leaves the opportunity to define the role of visualization to aid model explanation and model assessment.
The Ethics of Data: Understanding The Impact and Bias
Speaking of bias, the second big theme of this past year and continuing into 2022 is data ethics.
Machine learning remains a juggernaut. Alongside that, we are seeing a growing public awareness of some of the pitfalls of ML and AI, even with general data work. These pitfalls include bias in the data that we’re using to train and deploy models and also bias introduced by modeling approaches themselves. We see this play out in continued strong interest for explainable AI methods.
In the public sphere, you have films like Coded Bias on Netflix, bringing a lot of these issues to the forefront and even the firing of AI ethics leads such as Timnet Gebru and Margaret Mitchell at Google, in part for writings critical of their employer. People are asking how well can the industry regulate itself?
There are examples online of deployed classifiers, including situations such as photo collections that are then tagging people as animals. Needless to say, this is deeply insulting to people’s basic humanity. In my own research, we’ve built visualization tools to look at things like word embeddings, where you take tons of documents and have a model consider terms that are definitionally “male” or “female,” such as father, mother, son, daughter, etc. Then, we find the attribute vector that connects them and we look at how all the other terms are distributed relative to that. It’s immediately obvious there are all kinds of stereotypes, many of which are completely inappropriate and harmful to downstream tasks. From the data coming in, to how people are represented, to intermediate models, decision-making issues, and so on, these are issues we have to put front and center and deal with as data professionals.
In The Data Wranglers podcast about this topic, Joe Hellerstein shares stories on how AI technologies in the criminal justice system can be used to convict people and even put them on death row. And he says, those models are not necessarily explainable. It’s worth a listen to the episode to hear his full take and the research he’s working on at UC Berkeley: trifacta.com/podcasts.
The COVID Pandemic: How Do We Use Data
What’s a look back at 2021 without discussing the COVID pandemic? From a data perspective in the last 18 months or more, we’ve seen a proliferation of both data analysis and visualization, in response to the pandemic. Visualization and various forms of data crunching have come to the forefront of the public consciousness. Who would’ve predicted in early 2020 that people would be familiar with seven-day moving averages per capita rates and even SIR epidemiological models?
At the same time, however, we’re seeing numerous examples of misleading analyses and visualizations. This ranges from the ham-handed to the statistically savvy. Nevertheless, many examples torture and willfully misinterpret a dataset to arrive at a foregone conclusion. As we consider media literacy, data and statistical literacy, I think we’re seeing this play out and evolve in very different ways in the public sphere.
I think this brings us to hopefully a more optimistic note going forward. I recently had a great conversation with Francoise Pickart at the Washington State Department of Health. She had this interesting observation that we have seen 50 years of disinvestment in public health infrastructure, but that’s changing in large part due to the response to COVID. Public health data systems are needed to aid epidemiologists and policymakers, and they’re getting more investment now. This includes modernizing the data stack, moving to the cloud, pulling in lots of different data sources, and then processing them in real-time to be able to provide guidance and organize the response to the pandemic. Not only should this be helpful in terms of our current situation, but hopefully also lay the groundwork so that we’re better prepared to identify, monitor, and respond to future diseases.
If 2022 is anything like the last two years, predictions are truly a guess at best. But, learning from 2021, we know the expansion of the cloud, and specifically cloud data engineering, a greater reckoning with data ethics, and the impact of the pandemic on how we look at data will all continue to be relevant.