January 29, 2014

By Joe Hellerstein

Data Science: From Hubris and Machismo to Human-Centered Design

If you follow discussions of Big Data, you may have heard people bandying about a new phrase: “the death of the data scientist.” The crux of their logic seems to be that technology will automate away the need for data scientists. I’m friendly with many data scientists, and to misquote Mark Twain, I believe rumors of their deaths have been greatly exaggerated.

The idea of “automated data science” flies in the face of hard business realities. Today’s typical business analysts still do not have the technical ability to work with big data. Instead, they rely heavily on data scientists and IT professionals to go from theories and hunches to data sets, models and analysis. Is the reliance on specialized technical folk a source of frustration in the field? Of course. And when it’s not addressed, it prevents organizations from gathering and using the data they need to stay competitive. But the problem is not going to be solved by declaring the death of data scientists! On the contrary, the solution is dependent on organizational will to embrace data science methodologies and talent. Annika Jimenez has written thoughtfully and provocatively on this front; DJ Patil’s writing on building data science teams is also relevant here.

More to the point, anyone promising to automate away the need for people in data analysis is engaging in pointless hubris. Data analysis is a process that fundamentally revolves around people, not just technology: people who can understand the links between business problems and relevant data, forming hypotheses and interpreting the resulting numbers. At bottom, data science—like all science—is a creative human activity. Take away the science, and all you have is data. My colleague Jeff Heer is fond of citing the famous statistician John Tukey on this subject: “Nothing — not the careful logic of mathematics, not statistical models and theories, not the awesome arithmetic power of modern computers — nothing can substitute here for the flexibility of the informed human mind.”

On the Other Hand…
I will readily admit that the data science community bears some responsibility for the backlash against them. There is a persistent form of nerd machismo that says that serious data scientists have to be three things at once: programmers, statisticians, and businesspeople. In the early days of the Data Science phenomenon, my buddy Mike Driscoll, founder and CEO of Metamarkets, a real-time analytics platform, semi-famously stated that the “sexy skills” of the data scientist must cover “three key, yet independent areas: statistics, data munging, and data visualization.” As time has passed, very few of these triply-skilled data scientists have emerged. (Though the few who I’ve met are alive and well, I’m happy to say. And they’re rare enough that they can write their own ticket—many are off on brave but idiosyncratic journeys.)

In more recent times, experienced and talented data scientists like Driscoll have gotten more realistic about setting the bar for successful data science. Given that the triple-threat data scientist is a rare bird, how do folks compensate?

One answer is with human resources: by forming teams of two or three people to cover these three areas of competency. DJ Patil speaks to this approach in his writing. This is a sensible solution, but it’s expensive—and infeasible for many organizations in a market in which skilled data people of all stripes are rare.

How Technology and Design Can Help
Another answer is for technology to work synergistically with human analysts to make data science a more productive and broadly accessible job. We have seen the ability of technology to aid people in data visualization over the course of a decade. Companies like Tableau, and open-source packages like D3.js and Vega.js have greatly expanded people’s ability to easily create useful and elegantly-designed visualizations.

The data analysis process needs a similar technological boost. The key problem is data transformation: what Driscoll called “munging”, and others call “data wrangling”. This set of tasks takes as much as 80 percent of a typical analyst’s or data scientist’s time, inhibiting their ability to find insights and improve business processes. Data transformation traditionally involves significant custom coding or scripting—it is more Software Development than Science. This has made data transformation inaccessible to most business analysts, and unattractive to the quantitative data scientists, who tend not to be trained software engineers. Driscoll, who has a degree in the life sciences, equated data munging with “suffering”. Quants have a thirst for data, but they prefer it served neat.

On this front, I believe that carefully-crafted technology can indeed remove much of the burden. The key is to recognize first that data transformation for analysis is custom work, which needs to be steered by the humans involved in the analysis. The whole “death of the data scientist” meme only leads to failed attempts at automated solutions that are too brittle for people to work with. Following this observation, the challenge is to develop new transformation technologies that are artfully designed to fit with and enhance the skills and knowledge of the analyst, be they a business person or a data scientist: leveraging human intuition and business context, promoting the ability to explore raw data, and encouraging agile iterative work at the speed of thought.

I’m passionate about this topic because it’s a very big deal, and creative solutions are within reach. Data transformation is at the crux of data analysis in the large, and key to the practical use of data within business. It is the biggest technical bottleneck in effective data analysis today. And I believe it can be made much, much easier and more accessible—to business analysts as well as data scientists—if we take an open-eyed, human-centric approach to technology, and stop declaring the “death” of anybody.