When one of the biggest healthcare providers designed and implemented a data lake, they had big expectations. This was an obvious solution to storing its variety of data sources—semi-structured reports, unstructured physicians’ notes, plus volumes of spreadsheet data—in their native formats, which they couldn’t do in a traditional data warehousing scenario. With a data lake, the IT department could better service the entire organization for all of its present and future data needs.
Fast forward two years, however, and the data lake had not seen much use. Part of that was from a lengthy implementation process, but the other, existing problem was that it was gated behind highly-technical IT users. Business users couldn’t access the Hadoop-stored data they needed. And, they couldn’t make accurate requests to IT for data extracts without seeing the raw contents first. In short, all of time and money that IT had invested in the data lake was not seeing returns. They were dealing with a “frozen” data lake.
Data Wrangling for the Data Lake
IT began by talking directly to business units with varying objectives—marketing, healthcare, security, HR—and assessing their data needs. It was clear that business units did have initiatives they wanted to accomplish, they just weren’t able to execute. That’s when they discovered self-service data preparation.
Self-service data preparation, or what we call “data wrangling,” is the process of converting diverse data from their raw formats into a structured and consumable format for business intelligence, statistical modeling tools or machine learning. What is unique with Trifacta’s solution is that this overall experience is built for non-technical users, making data wrangling intuitive, efficient and even enjoyable. Trifacta automatically discovers the data, structures it in a familiar grid interface, identifies potential invalid data and suggests the best ways to clean and transform the data.
For this healthcare provider, data wrangling meant access to the Hadoop data lake for business users, allowing them to launch business-critical data initiatives and discover new insights.
Situating Trifacta in a Data Lake
So what does data wrangling look like when implemented? A traditional data lake will have at least three zones, or reservoirs, which is a subset of the data lake where a user can access, explore, and refine data. Trifacta, specifically, sits between the data storage and processing layer and the visualization or statistical applications used downstream in the process. As a best-in-breed technology, this allows our customers to create a personalized full-stack solution.
Data wrangling in the data lake typically occurs within a zone or is the process for moving between zones. Users may access raw and refined data to combine and structure it for their exploratory work or for defining new transformation rules they want to automate on a regular basis.
Trifacta can also be used for lightweight ingestion bringing external data sources (e.g. Excel, and relational data) to augment data already in the data lake for the purpose of exploration and cleansing.
Access for Analytics
Wrangling often occurs in the production zone to deliver data to the business insight layer. This can occur using SQL-based BI tools, or by exporting the data in a file format (eg. CSV, JSON, Tableau Data Extract format) for further usage with analytics tools including Tableau, SAS or R.
Trifacta augments existing data governance processes on the data lake by logging all data access, transformation, and interaction within the solution and makes that data available to data lineage tools so administrators can understand the provenance of data.
The Bottom Line
For organizations in the process of designing a data lake, it is imperative to incorporate self-service data preparation, lest your investment go to waste. User-friendly data preparation spurs adoption and allows users to truly take advantage of Hadoop’s power and agility, without requiring a deep understanding of the underlying technology. Ultimately, this means more users on Hadoop, driving more insights, and impacting their organization’s bottom line. For an ice-free data lake, integrating data wrangling is essential.