Wrangle Summit 2021 On Demand

You can still experience the best people, ideas and technology in data engineering, all in one place

Get All-Access Pass
All Blog Posts

The Importance of Data Wrangling for Machine Learning

April 10, 2018

To businesses across every industry, machine learning and artificial intelligence have the potential to drive huge benefits. But what does it look like to implement a machine learning initiative in practice, and what are the pitfalls that organizations often run into along the way? To the latter question, Harvard Business Review has a clear answer: bad data, which they call “enemy number one.”

In order to learn more about machine learning, and the importance of wrangling data for it, we sat down with André Balleyguier, a data scientist at DataRobot. DataRobot accelerates the creation of the AI-Driven Enterprise by automating the critical machine learning process of building and deploying highly-accurate predictive models. To Andre, Trifacta and DataRobot are a perfect pair in order to accelerate the data preparation and model development process for faster, more accurate predictions.

1. Machine learning has become somewhat of a buzzword, but what does machine learning actually look like in practice?

Machine learning has indeed become a buzzword! But it’s not a new concept; in fact, it’s been around for decades. When you break it down, machine learning is a set of techniques that enables computers to “learn” patterns and rules from historical data. If you think of a computer as the student, and a data scientist as the professor, then machine learning algorithms are teaching methods and historical data are homework. Once computers have learned from their “homework” and developed models, they can make automated decisions on new data. Ultimately, this is what makes it possible for artificial intelligence to scale—without machine learning, having to manually program all of the possible scenarios for each user interaction would be near impossible.

Today, with the ever increasing amount of data and computing power available, we see more and more companies adopting machine learning to optimize all parts of their business. In practice, you can already find machine learning systems in many facets of your life—when your bank declines a suspicious transaction, when your phone operator contacts you to make a discounted offer, or when your inbox filters junk emails.

2. How are companies leveraging machine learning today? What are some of the challenges that arise along the way?

In terms of progress with machine learning, some data-driven industries, such as social media or e-commerce, are very advanced with machine learning initiatives because doing so is critical to remaining competitive. However the vast majority of industries are still in the early stages of adoption, primarily due to a few key challenges. One, it’s difficult and expensive to build up a data science team that can deploy machine learning. Second, it’s not an easy leap for business executives to understand the value of machine learning in the first place: spotting machine learning opportunities with high potential value requires a lot of education. And finally, a lot of businesses have difficulty leveraging their data because it is locked in siloed data warehouses and requires significant effort to access and standardize.

3. What is the importance of wrangling data for machine learning, and what are the potential risks of having quality issues? Can you share any examples of this that you’ve seen play out amongst customers?

As I mentioned, historical data is foundational to machine learning—that’s what allows computers to learn and advance its artificial intelligence. So, it follows that if the data is dirty and contains meaningless or unreliable information, then the algorithms will not derive any valuable pattern. The concept of “garbage in, garbage out” is particularly true in machine learning. If your data is not cleaned and prepared in the way that you need it to be, there is risk that your models may actually make unexpectedly wrong decisions that could affect your customers or your revenue. It is critical to understand the limitations of the data you use as an input and what you can expect from your model results.

I saw this first hand while working on a banking project. The objective was to create a system that could identify customers that might be interested in purchasing an asset. One key challenge was acquiring a single view about the companies by integrating datasets gathered across the entire business and collected by different teams (traders, sales, market providers, etc.). There was no “unique key” available to identify a customer across the business, and, in some cases, more than 80% of the data was missing because we were unable to match customers together across the different sources. Our initial model seemed surprisingly good at identifying customers likely to purchase, but we soon discovered the model was actually “cheating” by leveraging the poor data quality. If these decisions had been used in production, the algorithm could have harmed the business.

4. How does data wrangling impact customers with the ability to add new variables and data sources? How does this affect the outcome of their machine learning model?

Data wrangling can be a very time consuming part of the job of a data scientist. A machine learning project is a very iterative process, of which data wrangling is a critical step, that usually goes something like this:  

  • An analyst will typically start with a small set of available data that needs to be wrangled, and will aggregate a dataset with a “target variable”, i.e an outcome that the business would like to predict. For example, how much sales next month? Is this transaction fraudulent?
  • After some initial data exploration, the analyst will build the first model.
  • In most cases, the first model will not be good enough, forcing the analyst to find additional data sources or transformations needed to improve the model.
  • The analysts will then need to source new data, wrangle it with the original dataset, and build and evaluate new models again. And so on.
  • The final model is then deployed into business processes.

Within a single project, there could be dozens of iterations. In fact, I’ve seen a lot of data science projects eventually fail because they take too long to produce results. In order to optimize chances of success, it is critical to reduce the overall iteration time and to adopt a “fail fast” approach. In my opinion, the ability to accelerate data wrangling and integrate it with a machine learning framework is a key component for achieving this outcome—it allows for faster time to results, and more opportunity to engage with key stakeholders.

5. What types of users do you see wrangling data and leveraging machine learning? To that end, what capabilities do you consider important in a data wrangling technology?

In the past, machine learning projects have been almost exclusively owned and deployed by expert data scientists. Engineers, too, have been part of this process to build reliable systems that process data and integrate models into applications.

However, an increasing number of modern technologies are reducing the barrier to entry and enabling business analysts (or other non-expert users) to both wrangle data themselves and build and deploy machine learning models themselves. When working with data wrangling technologies geared toward these end-users, the capabilities that I’ve considered important are the ability to:

  • Integrate data from various data sources
  • Visually display the contents of data to help guide the right transformations
  • Ensure that the process of wrangling data is as intuitive as possible, ideally by reducing the amount of coding required
  • Allow for reusable data transformation pipelines
  • Scale to work with huge amounts of data and integrate with big data standards
  • Easily integrate the wrangled data into a machine learning framework to build the models and mine the data.

6. What do you see as the future of data wrangling and machine learning? With the right technologies, what do you think companies will be capable of?

In my opinion, it is essential that data wrangling technologies work hand-in-hand with machine learning technologies, as well as involve domain expertise and business stakeholders into the process as much as possible. I have talked to multiple companies where the teams wrangling data were working independently from the teams building machine learning models, and it led to a dramatically slower iteration cycle.

By providing an intuitive interface for business users, a high degree of automation, and a flexible and transparent environment, modern technologies like Trifacta and DataRobot allow for a much wider range of business experts to drive machine learning projects. These technologies help place the domain knowledge at the forefront of those initiatives. Moreover, I’ve also seen data scientists leverage these technologies to be more productive and free up their time for more complex problems. With successful adoption, companies can address the demand for machine learning and become truly data-driven enterprises.

To learn more about how Consensus Corporation is using Trifacta.