Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

How To Use a Modern Data Prep to Optimize AI and ML on AWS

 
September 24, 2019

At a recent Trifacta hosted AWS DevDay in San Francisco – a free, hands-on workshop event co-sponsored by AWS and partners, we shared tips and tricks on how to leverage a cloud-native, intelligent data prep solution to accelerate AI and ML use cases on AWS.

Next week we are bringing the DevDay to NYC. Please join us on Oct. 3 for a half-day, free workshop in downtown Manhattan. Register today to save your seat!

Your AI or ML result is only as good as the data you use to train and test your models. Getting the dataset right is step #1, and perhaps the most critical one for the success of your AI and machine learning project.

However, getting high-quality dataset(s) for AI and machine learning can be a daunting and time-consuming task if you don’t have the right tool to prepare your data. Many attendees at our workshop told us they are either spending a majority of their time writing code to prepare the data, or leveraging the light-weight data prep capabilities included in their analytics solutions to fix the data, which not only leads to a less ideal outcome but is also costly. The feedback from the attendees (the majority of them were data scientists) revealed that the coding approach is rather rigid. Whenever there is a change in the data, whether it be a variation in the schema, or adding a new data source for enrichment, the code can often break. In addition, there is no ability to assess the data quality before and after transformations have been applied, reducing a user’s confidence in the dataset for model training. As a result, data prep for AI and machine learning becomes extremely time-consuming, error-prone and costly, that’s what prompted these users to come to our workshop.

Trifacta offers a best-in-class, machine learning-powered solution to streamline the entire data prep process for all types of users, ensuring clean, relevant data is quickly and easily available for your ML and AI initiatives. For customers running their advanced analytics on AWS, our industry-leading data prep solution is tightly integrated with the major AWS services including Amazon S3, AWS Glue, Amazon IAM, Amazon EMR (for job execution) as well as a wide range of Amazon Machine Learning and AI services within the AWS ecosystem. These native integrations allow customers to take advantage of the elastic scalability, flexibility, security and cost-benefit AWS has to offer.

The hands-on workshop was focused on preparing data for the Personalization use case. Our participants had the opportunity to wrangle data with Trifacta on AWS to train and test a movie recommender model – MovieLens, through Amazon Personalize, a machine learning service allowing data scientists and developers to easily build personalized recommendations for their customers. Using Trifacta, our attendees were able to visually profile the MovieLens dataset in Amazon S3, quickly apply a number of transformations to the dataset by leveraging ML-guided suggestions such as Filtering, RapidTarget, and enrich the dataset with IMDB data by joining the two sources together, the combined datasets were then used to train the recommender system for more accurate movie recommendations. Users could profile the new dataset to validate the effectiveness of the transformation logic.


Fig 1. Profiling the MovieLens data in Trifacta


Fig 2. A flow view in Trifacta: Enriching the MovieLens data with IMDB data


Fig 3. Profiling the transformed MovieLens data to validate the quality

A workshop such as AWS DevDay provided a fun and engaging experience for everyone. Our participants had the opportunity to learn how to use a modern data prep solution and analytics services on AWS first-hand, while exchanging ideas and best practices with their peers. We are looking forward to bringing events like this to broader analytic and data science communities. Ultimately, we hope to empower all data practitioners with a modern data prep solution to expedite time to analytics insights.

Related Posts

From Raw to Refined: The Staging Areas of Your Data Lake (Part 1)

In this two-part series, we’re talking about the Hadoop data lake, both in terms of the necessary... more

  |  May 9, 2016

Wrangling Big, Diverse Data in Government

The following is a guest blog post from Nate Ashton, Director of Accelerator Programs at Dcode. It’s no... more

  |  March 12, 2019

Getting Clinical Trial Data Ready for Analysis: How IQVIA Wrangled its Way to Success

The following is a guest post from Trifacta customer, Yogesh Prasad is an Associate Director of IT at IQVIA,... more

  |  September 25, 2019