Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Free Data Cleaning in the Cloud

Get a free trial of Trifacta on AWS

Free Trial
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Product

Data Preparation with Trifacta for
Amazon SageMaker

< Back to Blog
 
October 15, 2018

In a recent blog covering the usage of Amazon SageMaker’s unique modeling algorithms such as DeepAR to better forecast, but also more traditional ones such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES); it came to my mind that all these algorithms expect well structured and clean data for data scientists to deliver the most accurate prediction.

However, forecast modeling depends on numerous data sets such as inventory data, promotions, past orders, products, or even weather data and product ratings originated from internal systems but more often than not, from various parties (retailers, distributors, brokers, manufacturers, CPG, public data, social media, etc.) with their own proprietary formats, standards and a very different perspective than data scientists of what data quality is.

Because the blog could have been a bit long, I created this step-by-step guide to outline the process to structure, clean, and combine these disparate data sets into a consistent format for data scientists to train and create a machine learning model in Amazon SageMaker.

Without a tool like Trifacta, most data scientists would have to spend a massive amount of their time preparing the data to be ready for ML. This is not the best way to spend a Data Scientists time. Data Scientists don’t love munging and cleaning data, it’s a waste of their skills to be polishing the materials they rely on.

If you are a data scientist and want to experience it first hand, from structuring to standardizing and feature engineering for Amazon SageMaker, you can leverage these files following the guide leveraging Trifacta free Wrangler Edition.

Retail Datasets Before Data Preparation

Retail Datasets Prepared for Sagemaker

Xgboost Jupyter Notebook for Sagemaker

Although the sample uses a popular SageMaker built-in algorithm, XGBoost, the process would be very similar for other training methods on SageMaker. By using other built-in algorithms, through deep learning frameworks such as TensorFlow, MXNetor, PyTorch, or with a data scientist’s  own custom algorithms.

To learn more about the SageMaker usage, data scientists can also take a look at Amazon SageMaker examples here.