Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Data Preparation in an AWS Data Lake

February 5, 2019

AWS Data Lakes Shift to the Cloud With Help of Data Preparation

Considering that data lakes originated to help organizations capture, store and process any type of data regardless of shape or size, it was pretty obvious that sooner or later enterprise data lakes would move to the Cloud to benefit from flexible storage and elastic processing. And on top of that, moving to the cloud has tremendous economic benefits when executing analytics workloads such as data lakes, modern data warehouses or a hybrid of both.

When considering cloud computing  and a data lake, AWS is a clear choice, offering the necessary data preparation capabilities to ingest, store, process, and deliver insight out of the data lake. AWS customers found that data lakes in the cloud had “better security, faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.” Many organizations working with an Amazon data lake use the Amazon S3 data lake because of its scalability. Amazon S3, Amazon Glacier, and AWS Glue are all compatible with AWS data lakes and make it easy for end users to access data.

However, one promise of a data lake has been to democratize data access and intelligence by enabling a larger number of analytics users to work with a broader diversity of data in a self-service fashion. This is where data preparation comes into the picture as a critical component of an AWS data lake environment.

The Role of Data Preparation in the AWS Data Lake

As an AWS Data & Analytics Partner Solutions Competency partner, Trifacta leverages typical AWS data lake services such as Amazon S3, Amazon EMR, or Amazon Redshift to enable data scientists, data engineers, and other data and business analysts to benefit from the abundance of data typically landed in Amazon S3 with data preparation capabilities. The Trifacta platform is focused on data preparation and enabling individuals with the greatest context for the data to more quickly and easily transform data from its raw format into a refined state for analytics and/or machine learning initiatives.

Architecture of an AWS based data lake with Trifacta Wrangler Enterprise

A data lake commonly offers various concomitant data zones (e.g. Landing Zone, Exploration Zone, Refined Zone, Production Zone, etc.) representing the various states of the data from its raw format for exploration, to a trustworthy and operationalized state for accurate decision making. These zones are common to an Amazon S3 data lake, and AWS services like Amazon EMR can be implemented according to the proper zone.   

The primary role of Trifacta for data preparation is to enable data lake users to wrangle data in a particular zone and in the process move it from one zone to another zone to fulfill a particular data process. Trifacta seamlessly integrates with AWS data lakes by reading and writing to Amazon S3 data lakes (often raw or intermediary data lake zones) and to Amazon Redshift (for the more refined data zone). Our platform also leverages Amazon EMR to execute data preparation recipes at scale and output data to the next stage in the refinement process of the data lake.

Governed BYOD in the Data Lake

Trifacta is also used for data lake users that want to bring their own data to the lake. While IT teams can automate data preparation; the ingestion of any volume and format of data to the lake, business users always want to bring external or unanticipated data sources on-demand (e.g. Excel, relational data, 3rd party data) giving them this extra level of autonomy to augment the data already in the data lake and gain deeper insights.

These users may also leverage Trifacta to automate their data preparation activities for repeatable and accurate data pipelines to run their business off of. Thanks to Trifacta, they can create their own personalized, self-service analytics zones in the lake for their own need or the broader team’s requirements.

We also recognize there’s a fine line between data democratization to enable more people to leverage diverse data, unlock innovation and business agility, while not turning it into a place of data anarchy where data is all over the place, not governed and leading to data security breaches or compliance issues.  

Trifacta fully respects all AWS IAM Roles with it’s data preparation to ensure that only the right users can access the data, and augments existing data governance processes on the data lake by logging all data access, transformation, and interaction within the data lake on AWS.

For more information on Trifacta’s data preparation for AWS, we invite you to register for our upcoming webinar with EMA Research – How to Streamline DataOps on AWS: Modernizing Data Management in the Cloud.

Related Posts

Responsive Data Analysis: Hadoop, Trifacta & Data Transformation

Cloudera’s announcement this morning highlighted the opportunity for Hadoop to have a significant impact on... more

  |  April 2, 2014

The Emergence of the Data Preparation Market

The following piece from Trifacta Data Scientist Tye Rattenbury was originally published in... more

  |  May 20, 2015

How to Put an Effective Metadata Strategy in Place

This article was originally published Information Management on July 27.  The proliferation of analytics has... more

  |  July 31, 2017