See How Data Engineering Gets Done on Our Do-It-Yourself Data Webcast Series

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

The Two Keys to a Functional Data Lake

April 19, 2016

Today, organizations are increasingly investing in data lakes for increased storage, as well as the ability to store native formats in a single environment. In theory, data lakes offer increased flexibility. But in practice, highly technical Hadoop infrastructure can often feel gated by IT and inaccessible to the rest of the business. Strict processes must be enforced, forcing business analysts to ask their technical counterparts for access to specific data, with little room for exploration. Meanwhile, IT resources are bogged down by preparing one-off data set outputs for various business teams This can lead to increased frustration on both sides and a lack of ROI delivered out of these initiatives.

While the data lake has the potential to transform an organization, it requires the appropriate tooling for end users as well as the required governance processes to truly drive value for organizations.

Your Data Lake’s Missing Key

So what’s missing? How can business users unlock the potential of their organization’s Hadoop data lake for more effective analytics initiatives? According to 451 Research, that’s self-service data preparation. “Self-service data preparation provides the user interface for reducing the time taken to analyze the data and extract true value from the data lake,” 451 Research reports. “[It is] a means to reduce the burden on IT to prepare data for end users, and in doing so reduce the time taken for users to discover, integrate, cleanse and enrich data to make it suitable for analysis.”

In that sense, it is imperative to design a data lake that incorporates self-service data preparation, lest your investment go to waste. User-friendly data preparation spurs adoption and allows users to truly take advantage of Hadoop’s power and agility, without requiring a deep understanding of the underlying technology. .

Access Doesn’t Sacrifice Governance

Of course, with increased access comes a caveat—“While Hadoop offers a more flexible schema-on-read approach to analytics, it is clear that it is also not a free-for-all,” reports 451 Research. “At the very least, a data catalog is required so that users can create an inventory of exactly what data is in the environment in the first place. Data lineage is also important in this regard in enabling analysts and data stewards to understand where the data came from, and what transformations may have been made to it already.”

Data governance is an important component of data wrangling, allowing IT to manage data security without safeguarding it against the business units who need it. Consider legacy tools, such as Excel, which promote duplicates and unauthorized access vs. a platform that tracks and consolidates all data transformations.

The Bottom Line

Designing a Hadoop data lake might seem intimidating for many organizations, given the technical setup and maintenance required. But 451 Research makes the case for a data lake that offers the best of both worlds—easy access and regular use from business units, while ensuring data governance for IT compliance. The key is adopting a self-service data preparation solution that delivers an intuitive experience, while ensuring enterprise-wide data governance .

To learn more about 451 Research’s findings, read the full report here

451 Sink or Swim