Earlier this month, Amazon announced the general availability of the AWS Lake Formation, a fully managed service on AWS to facilitate the building and management of data lakes in Amazon S3, an object storage service offered by Amazon.
As an AWS certified ML Competency and Data & Analytics Competency partner, Trifacta is excited about the announcement. Natively integrated with the AWS platform, Trifacta accelerates the customer’s analytics journey on AWS by making the process of getting data ready faster and easier. With AWS Lake Formation going GA, Trifacta is looking forward to helping organizations of all sizes drive Amazon S3 data lake adoption with well-prepared, trusted data, all the time.
With its greater scalability, flexibility and cost benefits, Amazon S3 data lakes allow companies to store large amounts of data in various shapes and sizes to support diverse analytics use cases in the cloud, including BI reporting and data science projects using AI and machine learning. However, great analytics starts with great data, and great data requires time and effort to obtain. In order to perform meaningful analytics, data professionals today are spending the majority of their time preparing their data, often time manually, before they can analyze it. This time-consuming process is the biggest impedance to timely insights and new innovations driven from the analysis.
Expedite data prep for AWS data lake
Trifacta automates the arduous data prep process for all stakeholders and ensures clean, relevant and secure data is always available in an Amazon S3 data lake. Designed for users with various skill set in the analytics workflow, Trifacta’s modern data prep solution provides an intuitive, visual and interactive data prep experience for data workers to easily explore, structure, transform, and share the data, no coding required. Our business-friendly, intelligent data prep solution not only empowers all users to easily take on the data wrangling tasks themselves without having to write code or rely on IT, but it also accelerates the entire data prep process by facilitating the workflow orchestration in production. To meet the compliance and audit requirements, Trifacta uses Amazon IAM role and AWS Glue catalog services to centrally manage all user access as well as data lineage for the data preparation.
Where does Trifacta fit in AWS Lake Formation workflow
Seamlessly integrated with the AWS ecosystem, Trifacta leverages a range of AWS services, including AWS Glue, Amazon IAM, Amazon EMR, Amazon S3, to provide the scalability, flexibility and security benefits from AWS throughout the entire data prep when refining Amazon S3 data lake.
Fig 1. Refining Amazon S3 Data Lake with Trifacta
In a typical AWS lake formation workflow, data from different systems and sources can be collected and stored in Amazon S3 in its original formats. Once the raw data is in Amazon S3, users can use AWS Glue to crawl and organize relevant datasets they want in a data lake and move them to a staging area on Amazon S3. After which users can kick off the data prep process by launching a Trifacta instance in Amazon EC2 to start refining the data going in the data lake. Trifacta can read data from Amazon S3 directly, or through AWS Glue data catalog depending on the user requirement. To simplify the data preparation, Trifacta provides a browser-based, easy-to-use, responsive interface for end-users to visually profile the data, join and enrich the data, transform it using core features such as Transform by Example, Macros, as well as specific functions such as binning and skewness to wrangle the data for machine learning. Users can validate the quality of the data continuously throughout the data prep process. Our ML-powered inferences and transformation recommendations guide the user interactions throughout the data prep process, eliminating the need to write code or wait for IT to deliver the data.
To operationalize the data prep jobs, Trifacta compiles the user-defined transformation recipes down to Spark on Amazon EMR and outputs the processed data to the Amazon S3 data lake to fuel various downstream analytics use cases.
While empowering business users and data workers with a self-service data prep experience, Trifacta also allows data engineers and IT to schedule, publish, and monitor workflows centrally to reduce silos and improve operational efficiency. Every data preparation recipe or set of steps created in Trifacta can be set into a repeatable pipeline according to hourly, daily, weekly schedules or the user-defined schedule.
To ensure data security and governance during the data prep process, Trifacta leverages native cloud services such as Amazon IAM role for user access, and AWS Glue metadata catalog to manage data lineage on the single platform.
Building a cloud data lake is a complex project which can take months to years to complete. It requires skilled resources with expertise in various domains to manage the end-to-end lake formation process. The recent GA release of AWS Lake Formation will simplify the effort of building data lakes on Amazon S3. Augmenting the AWS Lake Formation service with Trifacta’s modern data prep solution provides organizations a quick path to clean, well-prepared data in their data lake, resulting in greater data lake adoption and faster time to analytics insights.
To learn more, watch this video: How to Use Data Preparation to Accelerate Cloud Data Lake Adoption.