Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Moving Analytics to the Cloud?

Survey of 600+ data workers reveals biggest obstacles to AI/ML in the cloud

Get the Report
Schedule a Demo

Building An Effective Cloud Data Lake with Trifacta

March 24, 2020

Ask a number of organizations what kind of data they use and they’ll likely each give you a different answer. Long gone are the days where data merely referred to spreadsheets or other structured formats; today’s data is big, messy, and of increasing variety. And with this phenomenon of “big data,” the ways that organizations are using data have transformed dramatically. 

Instead of clearly-defined use cases that pull data from operational systems, today’s analytics initiatives are exploratory in nature and draw upon a wide range of complex data sources. both internal and external to the organization. The stakes are heightened, too. Today’s data may serve as the foundation of multi-billion dollar products and the common language between IoT sensors, to name a few examples, which means capturing and understanding this data is business critical. 

As the types of data and the growth of data initiatives have changed, so too has the data architecture needed to support. In recent years, two of the most talked-about data architecture strategies have been the concept of data lake and cloud storage. But what is a data lake and how does it work with cloud storage? In this post, we’ll take a closer look at the question of what is a data lake, why organizations are deploying both cloud data lakes and traditional data lakes, what you need to get started, and how a data preparation platform is an enabler to their success. 

What is a data lake?

What is a data lake? A data lake is a central repository, similar to a database, that allows for the storage of structured and unstructured data at scale. Data can be stored in its original format instead of structured to fit particular specifications, as was the case with traditional data warehouses.

This flexibility is valuable for a couple of key reasons. One, it removes a huge amount of the upfront work involved in loading data into warehouses (which has only grown more time consuming and costly as data has exploded in size and variety). And two, this type of data storage lends itself to exploratory analytics. Since data can be stored as is, organizations don’t have to define how it will be used up front but rather can explore its usage in varying analytics initiatives. The data is also stored in one central repository, instead of many data warehouse silos across the organization. That means analysts don’t have to request data access one-by-one (or more common, never fully understand the breadth of data available to them), but can leverage the entirety of their data at once, leading to unforeseen patterns between varying data types. 

Moving the data lake to the cloud

Recently, “cloud” has been added into the data lake equation, as well. Traditional data lakes were built on HDFS clusters on-premises, but the current trend is to move and maintain data lakes in the cloud as an infrastructure-as-a-service called a “cloud data lake.” With no hardware to install or maintain, setting up a cloud data lake is much easier than on-premise. Cloud providers like AWS, GCP and Azure also offer the flexibility to scale storage up and down depending on an organization’s usage for increased costs savings and agility. 

Migrating a data lake to the cloud, however, is no easy task. A couple key considerations before transitioning to the cloud data lake are: 

  1. Don’t rush.
    Organizations must start small. Proving out ROI quickly requires that organizations not bite off more than they can chew. Organizations must set clear and manageable goals hand-in-hand with their cloud provider to ensure that their organization’s individual needs will be met.  
  2. Interconnectivity is key.
    During the transition, there inevitably will be a lengthy intermediary period where organizations manage a mix of on-prem and cloud solutions. For some organizations, given their security restraints, this blend of on-prem and cloud will be permanent. In any case, understanding what systems need to connect with each other for the short or long term is critical.
  3. Choose cloud-native technologies.
    Cloud-native technologies, such as EMR for Amazon and HDInsight for Azure, are far more efficient and allow organizations to leverage the full benefits of the cloud. These technologies are tightly integrated with the entire cloud ecosystem including storage, processing and security.

Avoiding the “data swamp”

Given the advantages of a data lake (and now, the cloud data lake) the concept of the cloud data lake has quickly grown in popularity. In fact, MarketWatch is expecting the global data lakes market to grow approximately 28% between 2017 and 2023. If a large organization hasn’t already implemented a data lake or cloud data lake, odds are they’ve at least considered how the strategy would impact their own operations. 

However, there is a different story told about the data lake, regardless of whether it is hosted in the cloud as a cloud data lake or on-prem as a traditional data lake. For as much potential as a cloud data lake has to bring value to an organization, there is also a huge risk that it will be a burden, even if it is completely secure. Congregating large amounts of complex value is only useful if that data is used—but for many organizations, the data in a data lake can go untouched because it wasn’t designed in accordance with the specific needs of its users. And the fact that a huge amount of different types of data can be stored in one repository invites opportunity for cutting-edge initiatives, but it can also invite chaos if that data lake isn’t managed correctly. The success of any data lake storage must include the following: 

  1. Defined data zones. 
    Organizations must set up zones that align with the given number of steps in their environment. This may include a landing area, an ingesting area or a modelling area. It’s also important to have different zones for different uses and create a sandbox area for ad-hoc operations. Organization is key—all of these zones should include an intuitive namespace design pattern that makes sense for the organization. 
  2. Structured and DIY loading.
    Ad-hoc data loads, or DIY loading, allows for initial exploration of data that may be unknown or unfamiliar. It allows users to rapidly assess the data and understand how it may be used—or whether it’s useful at all. Structured loading comes into play once analysts have understood the data they need and want to repeat the ingestion of such data. In these scenarios, organizations are dealing with high-quality, structured data that has been organized for long-term maintenance. Both are important strategies to integrate into a data lake and should be leveraged when appropriate.
  3. Data governance.
    A common misconception with the data lake is that just because you have the ability to dump anything and everything in your data lake, you should. However, that’s the easiest way to lead to a disorganized and messy data lake—in other words, a data swamp. A data lake may entice organizations with its flexibility, but it still demands a certain level of structure in the form of data governance. Without repeatable data governance processes and procedures, analysts won’t be able to trust their data and, ultimately, won’t be able to take action on it. Robust data governance includes defined metadata with data lineage, the ability to identify and control sensitive data, data quality and monitoring, policies around data source availability and master data management.

Adoption: The ultimate measure of data lake success

There are many complex architectural development and management processes involved in the making of a data lake, as well as in the transition from a data lake on-premise vs. one hosted in the cloud. IT groups must work in partnership with the organization as a whole to build systems work in accordance with core use cases and account for key data types. 

However, even when IT manages to flawlessly execute these huge undertakings, the ultimate business success of a data lake is about its adoption and usage. In other words, how many people are driving value from this data lake? What new analytics initiatives are users able to take on by way of the data lake? This isn’t always within IT’s direct control. Sure, they can create an organized data lake that is well-managed and houses trustworthy data, but if the larger community of business users don’t have the technical skills to access that data, the data lake will forever be relegated to a small group of data scientists and data engineers who can only drive a certain number of analytic initiatives. Investing in big data architecture is a time-consuming, costly endeavor that has proven its worth over and over again–but only if users are readily leveraging the data at hand. 

The 80% problem

One of the most difficult aspects of the analytics process, and often the biggest barrier for data analysts to access the data they need, is data preparation. It has been well-documented that data preparation routinely demands up to 80% of the overall time spent on any analytics project, and that percentage certainly hasn’t decreased as data has gotten bigger and messier. The end results of an analytics initiative can only be trusted if data has been properly cleansed and prepared up front. 

While IT will inevitably take on some of this data quality work, they can’t (and shouldn’t) take on all of it. For starters, each analytics initiative will demand that the data be prepared in different ways that IT simply can’t predict. And two, having IT spend all of their time preparing data instead of the difficult architectural management work described above is inefficient. Of course, IT will still curate the best stuff, make sure it is sanctioned and re-used (this ensures a single version of truth and increases efficiency). But, with business context and ownership over the finishing steps in cleansing and data preparation, these users can ultimately decide what’s acceptable, what needs refining, and when to move on to analysis. 

In order to bypass the data preparation barrier and increase data lake adoption, organizations need to adopt a data preparation platform that enables users of all technical abilities to access and prepare data.

Trifacta: Enabler to data lake success

Trifacta offers a unique data preparation platform that reduces time spent preparing data by up to 90%. Its interface is easy-to-use, intelligent and interactive, improving users’ ability to understand data immediately. Trifacta starts by automatically presenting users with the most compelling and appropriate visual representation based on their data. Every profile is customized and completely interactive, allowing the user to simply select certain elements of the profile to prompt transformation suggestions. Finally, users can choose to explore more detailed visual representations which present the data at its most granular level for deeper data exploration analysis. 

Perhaps most importantly, Trifacta can sit on top of all major cloud and on-premise platforms, providing users with a singular data preparation experience no matter where their data lives. Even as data platforms change, Trifacta can remain constant. This is important because as specific data architecture trends come and go, data preparation will continue to be a need for every analytics project. Investing in the Trifacta data preparation platform is future-proofed.

Related Posts

The Data Lake Ecosystem: Unique Data Ingestion Challenges—And How to Solve Them

In this four-part series, we’ll explore the data lake ecosystem—its various components, supporting... more

  |  May 25, 2016

Data-Wrangling: Darum geht es

Von den Rohdaten zur Analyse: Vielleicht haben Sie den Ausdruck Data-Wrangling in diesem Kontext schon... more

  |  May 20, 2019

Responsive Data Analysis: Hadoop, Trifacta & Data Transformation

Cloudera’s announcement this morning highlighted the opportunity for Hadoop to have a significant impact on... more

  |  April 2, 2014