Schedule a Demo


Data Preparation with Trifacta for
Amazon SageMaker

< Back to Blog
October 15, 2018

In a recent blog covering the usage of Amazon SageMaker’s unique modeling algorithms such as DeepAR to better forecast, but also more traditional ones such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES); it came to my mind that all these algorithms expect well structured and clean data to deliver the most accurate prediction.

However, forecast modeling depends on numerous data sets such as inventory data, promotions, past orders, products, or even weather data and product ratings originated from internal systems but more often than not, from various parties (retailers, distributors, brokers, manufacturers, CPG, public data, social media, etc.) with their own proprietary formats, standards and a very personal perspective of what data quality is.
Because the blog could have been a bit long, I created this step-by-step guide to outline the process to structure, clean, and combine these disparate data sets into a consistent format to train and create a machine learning model in Amazon SageMaker.

Without a tool like Trifacta, most data scientists would have to spend a massive amount of their time preparing the data to be ready for ML. This is not the best way to spend a Data Scientist time. Data Scientists don’t love munging and cleaning data, it’s a waste of their skills to be polishing the materials they rely on.

If you want to experience it first hand, from structuring to standardizing and feature engineering for Amazon SageMaker, you can leverage these files following the guide leveraging Trifacta free Wrangler Edition.

Retail Datasets Before Data Preparation
Retail Datasets Prepared for Sagemaker
Xgboost Jupyter Notebook for Sagemaker

Although the sample uses a popular SageMaker built-in algorithm, XGBoost the process would be very similar for other training methods on SageMaker. By using other built-in algorithms, through deep learning frameworks such as TensorFlowMXNetorPyTorch, or with your own custom algorithms.
To learn more about the SageMaker usage, you can also take a look at Amazon SageMaker examples here.