Join us on April 7-9, 2021

The first industry event focused on data engineering

Register Today
 
All Blog Posts

Data Preparation in an AWS Data Lake

August 22, 2020

Before we jump into the definition of an AWS data lake, let’s review why data lakes are important in the first place. 

A data lake is a central repository capable of storing both structured and unstructured data. The concept of a data lake is only about 10 years old, but it has already reengineered the foundational data strategy of organizations both big and small. Instead of relying on data warehouses, which demanded that data be modified upon storage to fit rigid requirements, organizations can now store data in its original format. The result is more efficient ingestion and flexibility for the organization to leverage data as needed, instead of locking themselves into a specific data structure up front. 

Early data lakes were built on HDFS clusters on-premises, but today’s data lakes aren’t beholden to a specific environment. Increasingly, organizations are moving their data lakes to the cloud to take advantage of the flexible storage and elastic processing benefits that the cloud offers. 

What is an AWS data lake?

An AWS data lake is built on Amazon Simple Storage Service (S3). Organizations can build a AWS data lake of any size based upon their own needs, and then choose to scale their lake up or down as the organization changes. What makes AWS data lake architecture unique is that it allows users to use native AWS services to run big data analytics, artificial intelligence (AI), machine learning (ML), among other initiatives. 

Benefits of an AWS data lake

When considering cloud computing and a data lake, the AWS ecosystem is a clear frontrunner. AWS customers found that data lakes in the cloud had “better security, faster time to deployment, better availability, more frequent feature/functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization.” Let’s take a closer look at some of the major benefits of an AWS data lake: 

    • Scalability
      Many organizations working with an Amazon data lake use the Amazon S3 data lake because of its scalability. Organizations can instantly scale up storage capacity as their data requirements grow.
    • Accessibility
      Amazon Glacier and AWS Glue are all compatible with AWS data lakes and make it easy for end users to access data.
    • Security
      Data is protected against failures, errors, and threats and designed for 99.999999999% (11 9s) of data durability. 
    • Integrations with third-party service providers
      Organizations can’t depend on a data lake alone. An AWS data lake allows organizations to leverage a wide variety of AWS Data and Analytics Competency Partners to add to their S3 data lake. 

The role of data preparation in an AWS data lake

One of the biggest benefits that a data lake poses is the democratization of data access. With one central repository for data, the organization can theoretically widen access to more users and allow for greater levels of self-service. No more dependency on IT to go hunt for data stored in specific warehouses—now, the organization has access to it all. 

However, it’s somewhat of a false promise. Even with a data lake, the majority of users don’t have the technical capacity to directly source and prepare data from the data lake for analytic use. There needs to be some sort of intermediary. And that’s where data preparation comes into the picture as a critical component of an AWS data lake environment.

Data preparation platforms give any user the power of an engineer or developer by way of an intelligent, visual interface. Instead of writing code to source and transform data, users can do so with a data preparation platform in a matter of clicks. 

The Trifacta data preparation platform on an AWS data lake

As an AWS Data & Analytics Partner Solutions Competency partner, Trifacta leverages typical AWS data lake services such as Amazon S3, Amazon EMR, or Amazon Redshift to allow users to cleanse and standardize data in Amazon S3. The Trifacta platform was designed to give users the greatest context for their data so that they could more quickly and easily transform it into a refined state for analytics and/or machine learning initiatives.

Architecture of an AWS based data lake with Trifacta Wrangler Enterprise

Once users have established the required data preparation steps required for specific initiatives, they can use Trifacta to construct repeatable data pipelines that automatically prepare new data as it is received. With their own personalized, self-service analytics zones in the lake, users can become experts in their own niche area of data without having to constantly involve IT.

And beyond data preparation, Trifacta can be used as a means for business users to bring their own data to the lake. Though IT typically handles data ingestion with a different set of tools, business users often uncover external or unanticipated data sources (e.g. Excel, relational data, 3rd party data) that they want added to the lake. Giving these users the autonomy to bring in these sources allows them to augment existing data in the lake and gain deeper insights. 

Governed data democratization

Of course, there’s a fine between data democratization and data chaos. It’s important to encourage more people throughout the organization to leverage diverse data and unlock innovation, while still considering data governance rules that prevent data security breaches or compliance issues.  

Trifacta fully respects all AWS IAM Roles with it’s data preparation to ensure that only the right users can access the data, and augments existing data governance processes on the data lake by logging all data access, transformation, and interaction within the data lake on AWS.

Click here to learn more about Trifacta for AWS.