What is Data Fabric?
As data grows increasingly complex and distributed, new data management techniques have emerged to meet the challenges that come along with it. One such technique is called “data fabric,” which Gartner named as one of its top 10 data and analytics technology trends for 2021.
What is a Data Fabric?
A data fabric is an integrated layer that encompasses all data connections and data sources within an organization, as well as the relationships that exist between that data. It is not a singular technology, but a design concept that leverages many different technologies, which work concurrently to ensure that all data is easily searchable. Since a data fabric has its finger on the pulse of all data throughout the organization, it can answer virtually any analytics query.
Metadata is the Backbone of a Data Fabric
A data fabric thrives with rich metadata. Metadata is “data about data,” in other words information such as what the data contains or how it is structured, and is essential for all stages of the data lifecycle. In a data fabric, the goal is for metadata to both connect interoperable components and serve as the barometer for the success of the data fabric and recommend areas of improvement.
To do so, a data fabric depends on two types of metadata: “active” and “passive,” as defined by Gartner. Passive metadata is metadata that has been designed for a predetermined use (such as data models, schemas, or glossaries) and also includes runtime metadata, which includes logs or audit information. Active metadata, on the other hand, is AI-driven. In a data fabric, active metadata is what will drive continued improvements to the data fabric design.
As much as possible, Gartner recommends that a data fabric should convert passive data into active data. This can look like “continuously analyzing available metadata for key metrics and statistics and then building a graph model” or “leveraging key metadata metrics to enable AI/ML algorithms that learn over time and churn out advanced predictions regarding data management and integration.” In both cases, the metadata plays an active role in improving the distribution of data across the organization.
Due to its critical role in a data fabric, metadata should be an important qualifier when selecting technologies. Organizations should prioritize technologies that share their metadata using open APIs and open standards in order to build a successful data fabric.
Why is a Data Fabric Necessary?
If the goal of a data fabric is unifying data for increased searchability and accessibility, why, you might ask, can’t organizations use data lakes or data warehouses to combine all of their data, instead of a data fabric? First off, data fabrics and other common data repositories aren’t mutually exclusive—in fact, a data fabric works best when accompanied with them.
However, the truth is, it isn’t realistic to expect that organizations rely on one centralized storage. Most have a mixture of different public clouds or a mixture of on-premise or cloud storage. On top of that, organizations ingest data from a variety of data sources, such as social media or IoT.
In the past, other solutions used to band together the many data storage and access points have fallen short. Organizations have tried point-to-point integrations, but each new integration adds significant cost and maintenance work for an organization, nor are they particularly scalable. Data hubs are another architectural solution that attempted to solve this problem, but they often introduced a higher risk of a lack of quality data.
The Benefits of a Data Fabric
The benefits of a data fabric ripple out to nearly all facets of an organization and primarily fall under three categories:
- Self-service data access & increased insights
This is perhaps the most tangible benefit of a data fabric. Since a data fabric allows for increased data integration and the ability for organizations to routinely analyze larger quantities of data at once, there is a much greater potential for new and more frequent analytic insights.
Additionally, a data fabric provides the business with a single access point to find data—no longer do they have to request IT to piece together data from various data silos. The ability for business users to find the data they need fuels further innovation and new analytics projects across the organization, the monetary gains of which can be tremendous.
- Automated governance
Incorporated as part of a data fabric is a data governance layer, which is uniformly distributed across all data access points. As a result, organizations are afforded increased trust and data transparency, and can automatically enforce data policies across the organization.
Depending on the level of AI, organizations can also use their data fabric to automatically apply data governance depending on the language used in certain documents or policies. In a matter of minutes, organizations can prove compliance and avoid potentially huge fines in the process.
- Automated data engineering tasks
Unlike traditional, end-to-end data integrations and manual data pipeline monitorization, a data fabric works largely on its own—there is no code to create nor maintain. Not only does this save data engineers a huge amount of valuable time, but eliminates the inevitable human error that comes with coding.
Using metadata, a data fabric also automatically helps optimize data integration, which improves data delivery, as well as workload balancing and elastic scaling. A data fabric can even help automate data discovery tasks, depending on the unique needs of the organization, to accelerate a data asset’s time to value. In essence, a data fabric reduces a lot of the necessary data engineering work.
The Main Components of a Data Fabric
As mentioned, a data fabric is not a singular technology, but the combination of many technologies. Using metadata as the underlying thread, these technologies must account for certain capabilities, which include, as defined by Garter:
- Data Catalog
A data catalog is a critical component of a data fabric. It allows organizations to access and represent all metadata types, and serves as an inventory for all data assets. Therefore, it is the data catalog that gives data the right metadata context so that it can be shared across environments. A data catalog also allows metadata to be added to certain data types automatically, and can extract certain metadata for storage.
- Knowledge Graph
A knowledge graph is what gives a data fabric its meaning. A knowledge graph enriches data with semantics about the usage of data across the organization so that it’s easy for analytics leaders to interpret. With the knowledge graph, the organization can better identify relationships across multiple data repositories, which can then be used in AI/ML algorithms to power data models.
- Active Metadata Management
Active metadata management technologies are critical to surfacing suggested changes to the data fabric brought about by active metadata. This allows the data fabric to constantly improve in an automatic fashion, without constant revision by data engineering.
- Data Preparation & Delivery Layer
The data preparation and delivery layer of a data fabric is where data is made available to users. It is important that the technologie(s) selected for this layer be accessible to all types of users—not just those within the IT department. In particular, business users should play a critical role in helping drive data preparation, which, because of their unique context, will allow the data to be best transformed and used for analytics.
For this to happen, organizations should follow an ELT (as opposed to an ETL) style. This allows for data transformations to happen after raw data has been extracted and loaded into its respective repository, which gives users more autonomy in deciding how it should be transformed. Selecting a data engineering platform that enables this ELT style and user-friendly data preparation should be a top priority for organizations interested in building a data fabric.
- Orchestration & DataOps
In order for data to follow continuously, and on-time, from one place to another, certain processes and scheduling must be in place. That’s what the orchestration and DataOps layer of a data fabric accounts for. In many cases, this functionality is built into data preparation and data engineering platforms since it is essential for seamless data preparation. Organizations should be able to “set and forget” many of their routine data preparation pipelines in order to ensure that timely and fresh data is always delivered.
It can be intimidating to begin a data fabric journey, but odds are, you already have a good place to start—your ELT processes. It is through these processes that you’ve historically handled the majority of your data integration work, and where you can now begin to adjust processes (such as moving to an ELT style) and add in necessary technologies to fill in any gaps in metadata, governance, data preparation, etc.
Adding more and more data to your core (with extensive metadata, of course) is the next step to building out your data fabric. The active metadata and machine learning models may be a bigger need to fill, but take your time—it’s better to start small and grow out a data fabric slowly than take too much on at once.
One thing is for sure, there’s a reason that Gartner named data fabric as one of it’s 2021 trends—the technique solves a lot of needs, and will only grow more popular in the coming years.
If you liked this post, read more about our thoughts on the latest trends in data engineering on the Trifacta blog.