By now, most organizations understand that there is a clear difference between data availability and data accessibility. It’s not enough to store large amounts of business-critical data and assume that, because it’s available, the rest of the organization will go looking for it.
Instead, data should be accessible—not buried behind complex architecture or reachable only through IT, but searchable and usable in the moment that the data is needed. In response, a wide range of user-friendly technologies and platforms have emerged onto the market in recent years that aim to give business users better access to the data they need.
However, the underlying data architecture needed to best support these platforms and the larger goal of data democratization has been a tougher problem to crack. There’s a lot of moving parts, time-consuming changes processes, and varied opinions. For a lot of organizations, solving this problem has been a huge investment in time and money.
Recently, data mesh has gained popularity for its innovative approach that solves some of the core challenges in previous architectural solutions. First developed by ThoughtWorks consultant Zhamak Dehghani, the concept has since gained serious traction.
Data mesh calls for the end of the monolith
Before we dive into what a data mesh is, it’s important to explain the context in which it came about. Principally, the data mesh stands in stark contrast with monolithic data infrastructures that aim to centralize organizational data.
The exemplary example of a monolithic data infrastructure is the data lake, which was popularized around 2010 as an alternative solution to traditional, siloed data warehouses. While data warehouses were reliable for smaller and more structured data, the architecture began to falter under the influx of huge quantities of unstructured data in the modern big data era. Since all data had to be properly structured before it could be stored in a data warehouse, the amount (and difficulty) of ETL jobs accelerated.
Unlike data warehouses, data lakes allowed companies to store any type of data—structured or unstructured—so that it could be later transformed for specific business uses. Not only did the approach save time up front by limiting ETL jobs, but it allows for increased innovation. Without a need to define data requirements in advance of analytics projects, there was much greater opportunity for new insights down the road.
While the data lake had a promising strategy, it proved to be less than smooth sailing. The “data swamp” quickly caught on as a term that referred to data that hadn’t been properly organized or enriched with metadata within a data lake. And, due to the irregularity and size of data stored in a data lake, there were long-running batch jobs.
Similarly, there was the term “frozen lake,” which referred to the inaccessibility of the data within the data lake for business users. Data lakes were typically managed by a small team of data platform engineers, who, despite their deep technical expertise, had little insight into business goals or context. Some tools on the market, such as data engineering platforms, solve this problem by giving users the ability to manage data pipelines and transform data themselves. Still, the underlying architecture didn’t seem best-suited to scale to meet the needs of business users.
What’s a data mesh?
Instead of isolating the consumption, transformation, and output of all organizational data to one place, a data mesh treats each domain of the organization as unique consumers. As such, each domain will handle their own data pipelines.
In this way, data functions somewhat like a product. It is uniquely designed and best-suited for the needs of its consumers, which means that each domain controls the ingestion, cleaning, and integration of their own data—or what we broadly categorize as the “ETL” or “ELT” processes.
Domains do not function like islands, however—connecting all domains is a layer of universal interoperability that ensures data governance, data standards, and data observability. In this way, each domain has the power to transform data themselves without risk of breaching security, using inconsistent language, or operating off of faulty data.
Similarly, while each domain should control all of the functional elements of data, it doesn’t mean that they have to have their own separate storage layer. In fact, organizations will often operate off of a singular storage layer, yet this layer should have independent schemas and independent access rights management so that data can be freely and independently distributed.
In Zhamak Dehghani’s article about the data mesh, she references how the data mesh parallels microservices architecture or loosely-coupled services that each have a specific business context or purpose. While the idea has gained popularity in software engineering, it hadn’t made the leap to data—until now.
The benefits of a data mesh
The benefits of a data mesh can be boiled down to five main concepts:
- Increased data accessibility
This is perhaps the biggest benefit to a data mesh architecture. By putting business owners in charge of their own data, they have more leverage to best transform and integrate data for analytics projects.
- Improved analytics
Leading with business context, instead of the technical knowledge of a central group of data engineers, often leads to more productive use of the data and, ultimately, better analytic results.
- Customized data pipelines
As organizations seek out increasingly complex analytic projects, the data pipelines necessary to execute those projects grow increasingly complex, too. The data mesh model embraces the complete customization of data pipelines.
- Standardized data observability
Data observability, or a pulse check on the health of your data, is a best practice for organizations, and one that a data mesh includes as a key part of its strategy.
- Decreased time to analytics
No longer do business teams have to wait on a small group of data engineers for data requirements, but can instead control the timing of their own analytics projects.
What’s the difference between data mesh & data fabric?
You may have heard of another, similar-sounding data architecture term: data fabric. While the two concepts have like-minded goals, their specific designs are actually quite different.
A data fabric is an integrated layer that encompasses all data connections and data sources within an organization, as well as the relationships that exist between that data. It is not a singular technology, but a design concept that leverages many different technologies, which work concurrently to ensure that all data is easily searchable. Since a data fabric has its finger on the pulse of all data throughout the organization, it can answer virtually any analytics query.
Like a data mesh, a data fabric prioritizes data accessibility and ease of use across the organization. It, too, has an integrated layer that connects all interoperable architectural components in order to maintain data standards and governance.
However, there are three core differences between a data fabric and a data mesh, which are:
- Product thinking is integral to a data mesh, not a data fabric. In a data mesh, using product design thinking as a means to serve the unique needs of each domain is fundamental.
- Data fabric relies on metadata; data mesh relies on domain owners. The underpinning of a data fabric is a rich layer of metadata, which drives recommendations for data pipelines, data delivery, the categorization of data assets, etc. On the contrary, a data mesh relies on subject matter owners to set the requirements up front for their domain’s data.
- Data fabric works with a center of excellence (COE); data mesh calls for new processes. A COE is a strategy that revolves around a data lake or data platform, where a small group of highly specialized data managers (typically, data engineers) control the organization’s data. A data fabric can work well with this strategy; however, a data mesh advocates for the complete opposite. Centralized, monolithic data platforms are counter to the goals of a data mesh.
Potential pitfalls of a data mesh
Despite the momentum of the data mesh (J.P. Morgan is one of many organizations throwing its weight behind this strategy), the industry has voiced several concerns. There are several potential pitfalls that organizations should be aware of before starting on a data mesh journey, which include:
- Data duplication
As data is reconfigured for a specific domain use, it strays from its original form and, in turn, becomes redundant and/or inconsistent. This duplication effect can prove to be a big problem for companies, impacting both data management costs and organizational trust in data.
- Increased number of technically-skilled employees
Under a data mesh architecture, since each domain is responsible for their own data pipelines and instrasture, there has been concern raised about how organizations will supply the necessary skills. What’s more, the approach appears inefficient; tasking a technically-minded employee to each domain, instead of the whole of the organization’s data, will quickly lead to scalability problems.
However, this issue is at least partially resolved with the fact that the data mesh should capture domain-agnostic data infrastructure capabilities for use in a centralized layer that handles the back-end storage and processing for each domain.
- Risk of technical debt
Since all data pipelines are uniquely created by each domain, there is a risk of a lack of maintenance and deteriorating quality that can lead to a huge amount of technical debt. In order to prevent this from happening, an organization needs to be sure to mandate who is responsible for quality checks and general data maintenance.
- Slow-to-adopt process
The data mesh is an ambitious architectural change and, as such, requires a lot of investment up front in order to be able to see it through. While this may not be a specific fault of data mesh, as most large-scale data infrastructure overhauls are time-intensive, decentralizing all data operations is no small feat. Organizations should be prepared to take on a lot of work before diving into a data mesh.
- Incorrectly chosen technologies
While the autonomy granted to individual domains is ultimately a boon of the data mesh, adopting technologies shouldn’t be a free-for-all. Each technology has an effect on the organizational data platform, which means there should be clear guidelines and oversight in place to ensure that the technologies adopted are standardized across the organization and future-proofed. For example, look for data engineering platforms with rich integration capabilities and a shared language, such as Trifacta.
- Inhibits cross-domain analytics
While the data mesh certainly solves many problems brought about by data lakes, it often fails to address a key benefit of the data lake: cross-functional analytics. Organizations should ensure they have a plan to address this need lest they close a door on analytic innovation.
Getting started with a data mesh
Despite its challenges, a data mesh still benefits a wide variety of organizations. It allows for wide-reaching data accessibility, fast turnaround times, and customized, domain-driven solutions all while operating under centralized data standards, governance, and observability.
In general, organizations with the most complex data infrastructure requirements are the most likely to benefit from a data mesh architecture. In other words, those with a large number of data sources and data domains, a huge number of data analysts and engineers, and a high priority for data governance. As these types of organizations embark on their data mesh journey, they should be sure to learn from successful examples, such as Intuit, Autozone, or the aforementioned J.P. Morgan.