Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

Why a Data-Driven World Needs a New Approach to Metadata

December 10, 2015

In the evolving context of the data-driven enterprise—which is so often focused on agile analytics and structure-on-use—we need a new approach to metadata services. In this post, I’ll explain why metadata services are stuck in the past, and what we need to support modern use cases.

Metadata Management Then & Now

In 20th Century enterprises, metadata was a deeply engineered artifact—the blueprint for enterprise-wide data constructs, painstakingly designed in advance to be the “golden master” of enterprise truth. Metadata management software of that era was designed to support this architectural philosophy, and its associated waterfall engineering processes.

By contrast, today’s data-driven enterprises are using data to generate new value, by exhaustively monitoring and creatively analyzing data about emergent behaviors—whether that data is generated by customers, products, devices, or physical sensors.  Creative data analysis is the key to turning logs of behavior into decisions and products of business value.  As a result, the lion’s share of useful metadata should arise through the agile work processes needed to capitalize on an ever-expanding world of behavioral data. The nature of this metadata is not known—or knowable!—until it is generated, as a byproduct of people working with data.  In short, data-driven enterprises need to capture metadata on use: the emergent, contextual information that arises naturally when data is assessed and analyzed in service of generating value to the organization.

Metadata on Use

To get a sense of this distinction, consider a scenario of how metadata might be generated in the lifecycle of the modern data-driven business. An innovative consumer electronics company decides to leverage usage logs from their devices for multiple purposes: to understand customer behavior, improve product usability, and offer new differentiated product features. This effort begins with aggressive data wrangling of the usage logs.  People who understand the business context work to gather raw logs across a variety of devices, teasing apart their structure and content, assessing and remediating data quality, and blending the multiple logs to enable analyzes across users, devices, time and geography. The final result might be a number of structured datasets that can be leveraged for analysis to support a variety of business purposes.

But the resulting data products are not the only value created during this process. The details that the analysts uncovered during the hard work of exploring and wrangling this data are gold, but are typically lost in their heads, and quickly forgotten.  These details include questions of why they chose some data and discarded other data; how and why they transformed their data; who was involved; which data sets were associated with which people, etc.  When this contextual information is captured and logged alongside the data products, there are benefits in the short term for documenting and debugging those products, and benefits in the long term for the organization’s understanding of what data they have, how it gets used, and who knows about it.

To support this fluid, unanticipated generation of knowledge, metadata services must be able to continuously support new users, new data sources, new types of metadata and new software components. At the same time, they have to provide an environment in which people—and software—can add value over time: mining, culling and organizing metadata in accordance with its utility, measured both in grassroots terms (e.g., via frequency of use) and strategic measure (e.g., stated value to the organization).

3 Requirements for a New Metadata Service

In order to succeed in the Hadoop environment, a new metadata service needs to meet basic criteria of interoperability and openness suited to metadata on use. The most important of these criteria can be derived from previously successful systems like HDFS:

  • It needs to be an open-source, vendor-neutral project.
    We need open metadata repositories with common APIs that foster connections across software.
  • It needs to provide a minimum of functionality and a maximum of flexibility, to leave opportunities for a broad range of unanticipated uses and value-added services.
    As Postel’s Law states, we need to be conservative in how we’re defining protocols, but liberal in what we accept from others.
  • It needs to scale out arbitrarily, both in volume and in workload; experience shows that metadata services can be Big Data problems in their own right.
    New metadata services must scale with both literal scale—data at scale, processing at scale—and be able to scale out with the diversity of use cases that will arise.

The Bottom Line

The time has come for open metadata services—and the potential for what we could do with these new services is huge. To learn more about the power of metadata, take a look at my presentation from this year’s Strata Hadoop event in New York: