In this post, we’ll discuss what a data lakehouse is and why it’s gained popularity in recent years. But first, let’s dive into the context of how data lakehouses came about.
The long reign of the data warehouse
For a long time, data warehouses were the architectural standard for storing and processing business data. They were the ideal intermediary between operational systems and business applications; data engineers could bring together many different data sources in a single data warehouse.
However, this could not happen before first transforming the data into a data warehouse-friendly format, which eventually became the burden of managing data warehouses.
While data warehouses were extremely reliable for smaller and more structured data, the architecture began to falter under the influx of huge quantities of unstructured data in the modern big data era. Since all data had to be properly structured before it could be stored in a data warehouse, the amount of time (and cost) of maintaining data warehouses exploded.
It became evident that organizations were shoehorning modern data use cases into an outdated architectural model. They needed a solution that would increase their ability to work with unstructured, unfamiliar data sources, and one that also wouldn’t come at the cost of slow load times.
The data lake solution
As a solution, organizations began to build data lakes, a repository that stored any type of raw data, structured or unstructured, so that it could be later transformed for specific business uses.
Not only did the approach save time up front by limiting ETL (extract, transform, and load) jobs, but it allowed for increased innovation. Without a need to define data requirements in advance of analytics projects, there was much greater opportunity for new insights down the road.
While the data lake had a promising strategy, it proved to be less than smooth sailing. The “data swamp” quickly caught on as a term that referred to data that hadn’t been properly organized or enriched with metadata within a data lake. And while load times were accelerated, due to the irregularity and size of data stored in a data lake, there were long running batch jobs.
Similarly, there was the term “frozen lake,” which referred to the inaccessibility of the data within the data lake for business users. Data lakes were typically managed by a small team of data platform engineers, who, despite their deep technical expertise, had little insight into business goals or context.
This isn’t to say that data lakes have been a failure—many organizations continue to successfully manage and reap huge benefits from data lakes today—but it’s clear that they, too, aren’t an entirely fault-free solution.
Putting the pieces together—a combined approach
With both data warehouses and data lakes having their benefits, and one not entirely fulfilling an organization’s needs on their own, many organizations put two and two together. They’ve built solutions that include both data lakes and data warehouses.
The approach makes sense in a lot of ways; structured data can be transformed and stored in a data warehouse, while unstructured, exploratory data may fall under the data lake category. That way, no one system is overloaded with too much data.
The downside here is that organizations must be equipped with the right team to be able to consistently move data back and forth between systems. Even then, there can be delays as data is moved or copied. Though the solution packages up the benefits of both data warehouses and data lakes, it inevitably introduces more complexity into the organization.
What is a data lakehouse?
And now, we’ve finally arrived at the introduction of the data lakehouse. The lakehouse is the most recent answer to the years-long question of the best way to process and store modern data types.
A data lakehouse combines elements of both a data lake and a traditional data warehouse and can simplify a multiple-system setup that includes a data lake, several data warehouses, and other specialized systems.
Some of its common elements, found in both data warehouses and data lakes, include:
- The separation of storage and compute
Separating storage and computing has become a widely-accepted approach for a number of analytic technologies. It entails separating out persistent data, or the data that is infrequently accessed and less likely to be modified, from the more constantly-changing transient data. Persistent data is stored remotely; transient data is stored locally. The approach allows for increased availability and scalability, as well as lowered costs.
- ACID transaction support
ACID is an acronym that refers to the four key properties that define a transaction, Atomicity, Consistency, Isolation, and Durability. ACID transactions ensure data reliability and data integrity, in particular in cases of failure or when different components are performing concurrent operations. For example, if the power went out during the middle of a transaction, ACID support would ensure that all data is still saved.
- Support for a wide range of data types
Like data lakes, data lakehouses can also store a wide range of data types, such as images, video, audio, semi-structured data, and text. The variety of data types that data lakehouses can store allows organizations to reduce data movement.
- Direct access to source data
Data lakehouses provide direct access to data for use in business applications, which improves the freshness of the data and limits the required maintenance of operationalizing multiple copies of the data between a data warehouse and a data lake.
- Schema support and data governance
Data lakehouses offer mechanisms to uphold schema standards and apply governance, ensuring that data within a data lakehouse is properly organized, governed, and consistent.
- End-to-end streaming
- By supporting streaming, data lakehouses can fulfill many organizations’ need for real-time reports, without having to incorporate an additional application.
What are some of the benefits of data lakehouses?
A solution that combines elements of data warehouse and data lakes has significant benefits, not least of which include:
Reduced data movement – Organizations no longer need to move data back and forth between a data warehouse and a data lake.
Reduced costs – Reduced ETL processes and de-duplication equates to lower costs.
Reduced maintenance – Unlike the time spent maintaining a data lake or transforming data for a data warehouse, data lakehouses require comparatively little maintenance.
Reduced complexity of data governance – With all data housed under one platform, adhering to data schema and governance rules is much less complicated.
What are some examples of data lakehouses?
The first data lakehouse marketed as such was Databricks Lakehouse Platform, which was released in 2020 as a rebranding that better suited its strategy of marrying data lake features with those of data warehouses.
Databricks, however, was not the first to use the term; in late 2019, Amazon described Amazon Redshift Spectrum, its service that allows Amazon Redshift users to query data stored in Amazon S3, as a “data lake house.” Certainly, this architecture allows for a convergence of a data warehouse (Redshift) with data lake storage (S3).
While Snowflake has not explicitly labeled itself a data lakehouse, it could also be classified as such. Snowflake describes its platform as a “data cloud,” which “combines data warehouses, subject-specific data marts, and data lakes into a single source of truth.”
Data warehouse vs. data lake vs. data lakehouse: Which one is right for you?
Between the three options, here’s a quick run-down of what different organizations may choose:
A data warehouse functions well for organizations that are working with structured data and focus almost entirely on business intelligence and data analytics use cases. They are best-suited for data analysts, as they allow seamless connectivity to business applications, especially when hosted in the cloud, as is the case with Google BigQuery or Amazon Redshift.
If organizations are working with unstructured or raw data, and want to take on more complex machine learning use cases, they would be better off opting for a data lake, such as Amazon S3, Google Cloud Storage, or Azure Data Lake Storage. However, this requires more technical data workers, such as data scientists and data engineers, as well as tight coordination and dedicated maintenance to ensure that the data doesn’t become disorganized and lacking in quality.
A data lakehouse, with the combined elements of both a data warehouse and a data lake, is best for organizations that need to provide direct access to data to business users, but also want to take on machine learning use cases or work with unstructured data.
What does the future hold for data lakehouses?
Data lakehouses are a relatively new technology, and have yet to fully play out among organizations to the extent that data lakes, and certainly data warehouses, have. Only time will tell whether they will live up to the promising solution they provide, or if cracks in their approach will begin to appear, much like it did with the data lake.
For now, they seem to have a bright future that aligns perfectly with the needs of data-driven organizations and the technologies at hand.