Data Lakes Shift to the Cloud
Considering that data lakes originated to help organizations capture, store and process any type of data regardless of shape or size, it was pretty obvious that sooner or later enterprise data lakes would move to the Cloud to benefit from flexible storage and elastic processing. And on top of that, moving to the cloud has tremendous economic benefits when executing analytics workloads such as data lakes, modern data warehouses or a hybrid of both.
When evaluating a cloud solution, AWS is a clear choice for implementing a data lake, offering the necessary data services to ingest, store, process, and deliver insight out of the data lake. However, one promise of a data lake has been to democratize data access and intelligence by enabling a larger number of analytics users to work with a broader diversity of data in a self-service fashion. This is where data preparation comes into the picture as a critical component of an AWS data lake environment.
The Role of Data Preparation in the AWS Data Lake
Trifacta, an AWS Data & Analytics Partner Solutions Competency partner, leverages typical AWS data lake services such as Amazon S3, Amazon EMR, or Amazon Redshift to enable data scientists, data engineers, and other data and business analysts to benefit from the abundance of data typically landed in Amazon S3. The Trifacta platform is focused on enabling individuals with the greatest context for the data to more quickly and easily transform data from its raw format into a refined state for analytics and/or machine learning initiatives.
Architecture of an AWS based data lake with Trifacta Wrangler Enterprise
A data lake commonly offers various concomitant data zones (e.g. Landing Zone, Exploration Zone, Refined Zone, Production Zone, etc.) representing the various states of the data from its raw format for exploration, to a trustworthy and operationalized state for accurate decision making.
The primary role of Trifacta is to enable data lake users to wrangle data in a particular zone and in the process move it from one zone to another zone to fulfil a particular data process. Trifacta seamlessly integrates with AWS by reading and writing to Amazon S3 (often raw or intermediary data lake zones) and to Amazon Redshift (for the more refined data zone). Our platform also leverages Amazon EMR to execute preparation recipes at scale and output data to the next stage in the refinement process of the data lake.
Governed BYOD in the Data Lake
Trifacta is also used for data lake users that want to bring their own data to the lake. While IT teams can automate the ingestion of any volume and format of data to the lake, business users always want to bring external or unanticipated data sources on-demand (e.g. Excel, relational data, 3rd party data) giving them this extra level of autonomy to augment the data already in the data lake and gain deeper insights.
These users may also leverage Trifacta to automate their data preparation activities for repeatable and accurate data pipelines to run their business off of. Thanks to Trifacta, they can create their own personalized, self-service analytics zones in the lake for their own need or the broader team’s requirements.
We also recognize there’s a fine line between data democratization to enable more people to leverage diverse data, unlock innovation and business agility, while not turning it into a place of data anarchy where data is all over the place, not governed and leading to data security breaches or compliance issues.
Trifacta fully respects all AWS IAM Roles to ensure that only the right users can access the data, and augments existing data governance processes on the data lake by logging all data access, transformation, and interaction within the data lake on AWS.
For more information on Trifacta for AWS, we invite you to register for our upcoming webinar with EMA Research – How to Streamline DataOps on AWS: Modernizing Data Management in the Cloud.