Tye Rattenbury, Trifacta’s Director of Data Science and Solutions Engineering, along with Trifacta’s co-founders, Joseph Hellerstein, Jeffrey Heer and Sean Kandel, have been busy authoring a book — “Data Wrangling: Techniques and Concepts for Agile Analytics — which is previewing at Strata NYC right now. In this blog, we sit down with Tye to understand the motivations behind the book.
Q: The movement towards widespread adoption of big data and agile analytics has been a building for quite some time. Since it was introduced over a decade ago, big data has been making significant impacts by adding value and insights to a variety of different industries. What do you see as the key drivers of this movement?
Tye: The drive for big data and data preparation has been growing at an increasing pace. While certainly not an exhaustive list of factors, three key ones are: a widening adoption of new infrastructures, increasing business competition and emerging cultural shifts around data-driven decision making.
New infrastructures that can collect and store larger volumes of data and metadata are a beneficial resource for companies to tap into when making business decisions. These volumes of data can contain key insights on customer preferences or potential operational efficiencies, for example.
Having the opportunity to leverage data to maximize and justify business decisions is an extremely important asset for gaining competitive advantage. Being able to predict what customers want, the risks in an investment, or the potential shift in the value of a good or service can propel a business far ahead of its competitors. Across every industry, leading companies are increasingly leveraging these kinds of predictions. Their advantage is pulling industry laggards into adopting big data practices.
Finally, while many companies have leveraged data to drive financial gains through one-off projects or specific initiatives, an increasing minority have shifted their entire organizations to become data-driven. Meaning every decision, small or large, leverages analytics as a crucial voice at the table.
Historically, significant analytics projects relied on the same general process: typically an IT team gathers, cleans and sorts the data before handing off to marketing and business professionals for analysis. How does agile analytics change this process?
The process you sketch out has some known problems. For example, companies that sector off their data analysis process into two distinct categories of preparation and analysis, often remove the data analysts who have the most context as to the potential business value of the data from the key cleaning and preparation activities.
So you’re saying that since analysts have the responsibility of uncovering insights that will ultimately be used to implement any necessary business changes, they need to be involved in the preparation process? Otherwise they could potentially miss out on the opportunity to truly explore their raw data?
That’s right. “Data wrangling” is the crucial process for making data for analysis, and it’s one that non-technical business professionals should drive, or at least be significantly involved in. When the non-technical analyst is empowered with the right tools to wrangle data themselves, they expose a greater breadth of insights from their data. Which leads to better follow-up questions. And the virtuous cycle continues: insights -> questions -> insights -> questions…
Got it. What other problems are there with a separated data cleaning and preparation process?
A wider set of problems are really about cultural shifts around data-driven decision making. As infrastructures for data and metadata are increasingly adopted, there needs to be a parallel and synchronized spread of “data literacy.” At its core, data literacy is about spreading the skills to work with data — to open up the black box of analytics, if you will. This is all about education. And, as the community of data literate people grows, the oversight, consistency and quality of analytics will also grow. In the long run, businesses driving value from data can leverage these communities to share best practices specific to their data to leverage a wider citizenship model of governance.
So I know Trifacta’s goal is to empower non-technical analysts as well as technical data scientists with the ability to explore and transform data but how does the book fit in?
The book is designed to bring some coherence and conceptual consistency to the practice of data wrangling. This is largely a language problem — key concepts and relationships need to be defined so that people can share and build on each others’ wrangling efforts. Today, people write custom code for each project; amassing a personal library of functionality, and ways of talking about that functionality, that even their closest “collaborators” don’t use.
If you’re interested in our book, you can request a free preview of the first three chapters. For more information about data wrangling, check out the resources on our website as well as Tye’s contributed articles in Datanami and Database Trends and Application. If you prefer a more hands-on data wrangling experience, contact our sales team at firstname.lastname@example.org to ask for a free Wrangler or demo so you can check out Trifacta’s solution for yourself.