Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

How To Use a Modern Data Prep to Optimize AI and ML on AWS

September 24, 2019

At a recent Trifacta hosted AWS DevDay in San Francisco – a free, hands-on workshop event co-sponsored by AWS and partners, we shared tips and tricks on how to leverage a cloud-native, intelligent data prep solution to accelerate AI and ML use cases on AWS.

Next week we are bringing the DevDay to NYC. Please join us on Oct. 3 for a half-day, free workshop in downtown Manhattan.

Your AI or ML result is only as good as the data you use to train and test your models. Getting the dataset right is step #1, and perhaps the most critical one for the success of your AI and machine learning project.

However, getting high-quality dataset(s) for AI and machine learning can be a daunting and time-consuming task if you don’t have the right tool to prepare your data. Many attendees at our workshop told us they are either spending a majority of their time writing code to prepare the data, or leveraging the light-weight data prep capabilities included in their analytics solutions to fix the data, which not only leads to a less ideal outcome but is also costly. The feedback from the attendees (the majority of them were data scientists) revealed that the coding approach is rather rigid. Whenever there is a change in the data, whether it be a variation in the schema, or adding a new data source for enrichment, the code can often break. In addition, there is no ability to assess the data quality before and after transformations have been applied, reducing a user’s confidence in the dataset for model training. As a result, data prep for AI and machine learning becomes extremely time-consuming, error-prone and costly, that’s what prompted these users to come to our workshop.

Trifacta offers a best-in-class, machine learning-powered solution to streamline the entire data prep process for all types of users, ensuring clean, relevant data is quickly and easily available for your ML and AI initiatives. For customers running their advanced analytics on AWS, our industry-leading data prep solution is tightly integrated with the major AWS services including Amazon S3, AWS Glue, Amazon IAM, Amazon EMR (for job execution) as well as a wide range of Amazon Machine Learning and AI services within the AWS ecosystem. These native integrations allow customers to take advantage of the elastic scalability, flexibility, security and cost-benefit AWS has to offer.

The hands-on workshop was focused on preparing data for the Personalization use case. Our participants had the opportunity to wrangle data with Trifacta on AWS to train and test a movie recommender model – MovieLens, through Amazon Personalize, a machine learning service allowing data scientists and developers to easily build personalized recommendations for their customers. Using Trifacta, our attendees were able to visually profile the MovieLens dataset in Amazon S3, quickly apply a number of transformations to the dataset by leveraging ML-guided suggestions such as Filtering, RapidTarget, and enrich the dataset with IMDB data by joining the two sources together, the combined datasets were then used to train the recommender system for more accurate movie recommendations. Users could profile the new dataset to validate the effectiveness of the transformation logic.

Fig 1. Profiling the MovieLens data in Trifacta

Fig 2. A flow view in Trifacta: Enriching the MovieLens data with IMDB data

Fig 3. Profiling the transformed MovieLens data to validate the quality

A workshop such as AWS DevDay provided a fun and engaging experience for everyone. Our participants had the opportunity to learn how to use a modern data prep solution and analytics services on AWS first-hand, while exchanging ideas and best practices with their peers. We are looking forward to bringing events like this to broader analytic and data science communities. Ultimately, we hope to empower all data practitioners with a modern data prep solution to expedite time to analytics insights.