Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Data Preparation with Trifacta for
Amazon SageMaker

October 15, 2018

In a recent blog covering the usage of Amazon SageMaker’s unique modeling algorithms such as DeepAR to better forecast, but also more traditional ones such as Autoregressive Integrated Moving Average (ARIMA) or Exponential Smoothing (ES); it came to my mind that all these algorithms expect well structured and clean data for data scientists to deliver the most accurate prediction.

However, forecast modeling depends on numerous data sets such as inventory data, promotions, past orders, products, or even weather data and product ratings originated from internal systems but more often than not, from various parties (retailers, distributors, brokers, manufacturers, CPG, public data, social media, etc.) with their own proprietary formats, standards and a very different perspective than data scientists of what data quality is.

Because the blog could have been a bit long, I created this step-by-step guide to outline the process to structure, clean, and combine these disparate data sets into a consistent format for data scientists to train and create a machine learning model in Amazon SageMaker.

Without a tool like Trifacta, most data scientists would have to spend a massive amount of their time preparing the data to be ready for ML. This is not the best way to spend a Data Scientists time. Data Scientists don’t love munging and cleaning data, it’s a waste of their skills to be polishing the materials they rely on.

If you are a data scientist and want to experience it first hand, from structuring to standardizing and feature engineering for Amazon SageMaker, you can leverage these files following the guide leveraging Trifacta free Wrangler Edition.

Retail Datasets Before Data Preparation

Retail Datasets Prepared for Sagemaker

Xgboost Jupyter Notebook for Sagemaker

Although the sample uses a popular SageMaker built-in algorithm, XGBoost, the process would be very similar for other training methods on SageMaker. By using other built-in algorithms, through deep learning frameworks such as TensorFlow, MXNetor, PyTorch, or with a data scientist’s  own custom algorithms.

To learn more about the SageMaker usage, data scientists can also take a look at Amazon SageMaker examples here.

Related Posts

How Data Engineers Have Helped Data Prep Grow Up

In recent years, a new term in data has cropped up more and more frequently: DataOps. As an adaptation of the... more

  |  November 14, 2018

Taking Data Wrangling to the Edge

With the official launch of Wrangler Pro (formerly Wrangler Edge), today marks an exciting point in time for... more

  |  November 1, 2016

Tutorial: Trifacta String Manipulation

Guest Contributor: Curtis Seare cohosts the Data Crunch podcast, edits the AI & ML Biweekly Beat... more

  |  May 16, 2018