In Part 3 of this blog series, we will be looking at how Trifacta helps improve accuracy, speed, and ease when developing machine learning and AI pipelines. If you missed the first two parts, Part 1 of this blog series gave an overview of the immediate value organizations can realize by adopting Trifacta for Snowflake, and Part 2 discussed the ways in which Trifacta can accelerate Reporting and Analytics use cases on Snowflake.
To start, let’s take another look at the following diagram.
In the past, organizations created rigid ETL pipelines for unchanging transactional data. As data has become more granular, moving from transactions to interactions, ETL cannot be relied on for the full spectrum of potential uses of the data and the need for agility. Instead, organizations are adopting an EL-T approach, where data ingestion platforms are used to ingest less refined data to a central location, and then transform it for its downstream purposes. Separating transformation from the ingestion process now allows data to be stored once and used everywhere rather than stored with specs fit for a singular purpose. With this change in paradigm, Data Scientists now have access to a huge variety of data in its raw state through a centrally managed location. Data scientists don’t need to go out and collect their own data, and IT teams have assurance of proper governance and lineage.
Machine Learning and AI
How does this relate to machine learning and AI? Organizations continue to struggle with the time consuming, difficult process of preparing data for machine learning and AI. It’s widely known that 80% of any data science project is spent wrangling the data. To compound this fact, machine learning models and AI require high quality data in order to be effective. Traditional tooling requires separate processes for profiling the data, cleaning and preparing the data, and validating the data’s quality. The challenge organizations face in the machine learning and AI race has opened up a huge opportunity for organizations to compete on differentiated data.
We see this commonly in insurance, retail, IOT and financial services where organizations are looking to leverage large volumes and varieties of data to gain insights into customer behavior, price optimization, fraud detection, and more. Data scientists are working with complex data formats, raw text, sensor data, and various other forms of structured and semi-structured data. These datasets require tons of upfront work to get them into a useful state, and then require the additional work of blending with other data sources, and engineering features in order to make for effective training data.
Trifacta’s visual guidance and built in machine learning create an interface focused on ease of use, instant validation, and powerful transformations, improving efficiency in development and reducing time consuming and tedious debugging of code. This speeds up the time to deploy models in supported data science platforms like Amazon Sagemaker and DataRobot, and other open source ML and AI technologies.
Trifacta for Machine Learning – Loan Defaults
In this walkthrough, we’re going to prepare a training dataset to use in a model to predict the likelihood of loan defaults. Trifacta accelerates time to value for machine learning projects by taking the 80% problem head on. Data Scientists benefit from efficiency gains by using Trifacta for data profiling, data cleaning, and feature engineering. You’ll see how Trifacta allows you to accelerate your data cleaning and feature engineering process by eliminating many of the steps that are just painful or tedious to accomplish using code. We’ll cover each of these aspects and how they interrelate as we walkthrough the Loan Default walkthrough.
Any machine learning project will require a high level of familiarity with the data used to train and deploy models. When working with unfamiliar data, it is essential to get a grasp of what types of values, data types, anomalies, and distributions you are dealing with. To get a quick look at this first dataset, we can check the columns view, and see there are quite a few columns with few or any valid values. We can start by removing these columns as they won’t be important to my models.
As we continue to profile our data, we may notice some data quality issues we want to address. Trifacta has a variety of ways to allow you to clean up data quality issues, ranging from guided to highly customized steps.
Next step to creating a feature rich dataset might be to blend two datasets into one. Trifacta makes joining unfamiliar datasets simple by suggesting join keys and giving plenty of profiling information to provide instant validation. No need to run the script and check the results. Take a look in the example below.
Then, we want to normalize and scale numeric columns, or one hot encode categorical columns. These are demonstrated below.
When we have gone through and wrangled our data to its output ready format, we can then run the job and publish the output back to Snowflake, where it will be used for training and deploying machine learning models.
Thanks for tuning in to Part 3 of this blog series. For part 4 of the blog series we will be taking a look at Data On-boarding use cases. If you are interested in trying Trifacta out for yourself, start today with Trifacta’s free 14-day trial!