Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Moving Analytics to the Cloud?

Survey of 600+ data workers reveals biggest obstacles to AI/ML in the cloud

Get the Report
Schedule a Demo

Data Preparation with Trifacta for
Amazon SageMaker

October 15, 2018

In a recent blog covering the usage of Amazon SageMaker’s unique modeling algorithms such as DeepAR to better forecast, but also more traditional ones such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES); it came to my mind that all these algorithms expect well structured and clean data for data scientists to deliver the most accurate prediction.

However, forecast modeling depends on numerous data sets such as inventory data, promotions, past orders, products, or even weather data and product ratings originated from internal systems but more often than not, from various parties (retailers, distributors, brokers, manufacturers, CPG, public data, social media, etc.) with their own proprietary formats, standards and a very different perspective than data scientists of what data quality is.

Because the blog could have been a bit long, I created this step-by-step guide to outline the process to structure, clean, and combine these disparate data sets into a consistent format for data scientists to train and create a machine learning model in Amazon SageMaker.

Without a tool like Trifacta, most data scientists would have to spend a massive amount of their time preparing the data to be ready for ML. This is not the best way to spend a Data Scientists time. Data Scientists don’t love munging and cleaning data, it’s a waste of their skills to be polishing the materials they rely on.

If you are a data scientist and want to experience it first hand, from structuring to standardizing and feature engineering for Amazon SageMaker, you can leverage these files following the guide leveraging Trifacta free Wrangler Edition.

Retail Datasets Before Data Preparation

Retail Datasets Prepared for Sagemaker

Xgboost Jupyter Notebook for Sagemaker

Although the sample uses a popular SageMaker built-in algorithm, XGBoost, the process would be very similar for other training methods on SageMaker. By using other built-in algorithms, through deep learning frameworks such as TensorFlow, MXNetor, PyTorch, or with a data scientist’s  own custom algorithms.

To learn more about the SageMaker usage, data scientists can also take a look at Amazon SageMaker examples here.

Related Posts

Giddyup!
Wrangling JSON Metadata via Un-Nesting

JSON is a popular file format used to store unstructured content. Many popular databases use JSON to, for... more

  |  October 27, 2017

January ’20 Wrangler Release Highlight – Job Results in PDF, Report an Issue

We are starting the new year with two highly requested features from our users! New in the the January ‘20... more

  |  January 29, 2020

Joins in Trifacta:
Making Data Blending Faster & Easier

A core value of ours at Trifacta is to start with the user, and our product team has made a huge effort to... more

  |  February 1, 2019