The following piece from Trifacta Data Scientist Tye Rattenbury was originally published in DBTA’s Big Data Quarterly.
Data preparation is gaining considerable visibility as a distinct aspect of data management and analytics work. With leaders in the analyst community producing the first series of reports on the emerging data preparation market – Dresner Advisory Services, Gartner and Forrester – we’re seeing the definition of this space take place right now. So what is driving the increased focus on data preparation?
The most important driver in the emergence of the data preparation space is the maturing big data market. More than a decade old now, big data has had significant impact, adding value in many industries and even helping new industries emerge. However, this impact has been unevenly distributed. Industries with the requisite skills and cultures to adopt high-tech infrastructures, such as smart manufacturing, social media and internet-based market-makers, including Uber and AirBnb, have been the clear beneficiaries.
Whether you believe the big data hype – that the technology and approach to analysis will revolutionize every industry – is practically irrelevant, as the shift is already happening. The desire to get ahead of this movement, coupled with the natural fear of losing competitive edge, will pull nearly every organization into exploring and likely adopting big data technologies and processes.
History as Context
The adoption of big data technologies and processes shares many analogies with its 20th century predecessor, enterprise data warehousing (EDW) and standard analytics pipelines.
The adoption of EDWs provided a fundamental change in the way organizations leveraged data, enabling, for the first time, a centralized mechanism for integrating and standardizing analysis of known value. EDWs generally operate in a schema-on-write fashion, requiring data to be structured in a-priori specified ways. Consequently, these tools struggle with unstructured or heterogeneously structured data. On the personnel front, the focus on structured data results in a heavy reliance on database technicians, engineers and SQL scripters. After an initial implementation phase, these users tend to work primarily on pre-specified reporting and analytical pipelines. Little analytical exploration or discovery occurs because the technology and processes were not designed to support it.
Now consider big data technologies and processes. While they have demonstrated incremental impact by lowering the costs of data management and analysis compared to EDWs, big data’s real impact will stem from innovations revealed by data-driven exploration. To fuel this innovation, big data workflows must involve an expansive range of data, mixing structured, unstructured, historically unused and newly generated data. The ability to handle this range of data is enabled by policies like schema-on-read and tools that both efficiently manipulate poorly structured or heterogeneously structured data and provide interactive feedback on aspects of data quality.
Ultimately, these tools help users surface potentially valuable patterns and predictions. The goal then is to assess their business value. Consequently, exploratory data analysis will increasingly rely on business analysts, subject matter experts and data scientists embedded in business units. They understand the business contexts where the analyses could have genuine impact.
In the big data ecosystem, data preparation capabilities will constrain efforts to fill an innovation funnel because the innovation will be driven more by data variety than analysis variety – incorporating unstructured and heterogeneously structured datasets, as well as new slices and aggregations of existing datasets. Part of the preparation time is discovery-driven. Users need to figure out what a dataset contains, as well as the quality and reliability of the values within that dataset. The rest of the preparation is about structuring, filtering and enriching datasets so they capture phenomena at the right level of detail and quality and are compatible with downstream analysis and visualization tools like Tableau, R and SAS.
Data preparation broadly refers to any actions taken to alter the granularity, scope, or structure of a dataset. The granularity of a dataset is defined by the kind of entities that it contains information about. In their most common form, datasets contain information about many instances of the same kind of entity. For example, health metrics pulled from patients in a controlled study or weather measurements from different locations. Patients are the granularity in the first example, locations/times in the second. Common data preparation steps might shift the granularity from patients to cohort groups or from locations to regions.
The scope of a dataset has two major dimensions. The first concerns the range of characteristics of an entity that are represented in the dataset. For example, patient health metrics could include blood chemistry, nervous system performance, bacterial populations or musculoskeletal integrity. The second dimension concerns population coverage. Are all the entities (e.g., patients, locations) represented in the dataset, or have any been intentionally and/or systematically excluded?
The structure of a dataset references its format or encoding. A commonly used format is a two-dimensional table where rows correspond to entities and columns to entity characteristics. Tabular data makes an explicit assumption that the same characteristics are measured for all entities in the dataset. When that assumption breaks down, it often makes sense to encode a dataset in a more extensible format, like JSON. Further along the spectrum are datasets where the characteristics of each entity have not been resolved or isolated, occurring as part of an unstructured field of text or numbers.
Overlaying the granularity, scope and structure characteristics of a dataset is the user’s awareness of these characteristics. The process of a user becoming more knowledgeable of their dataset is referred to as discovery.
Effective “Big Data” Preparation
Discovery is an ongoing activity and represents one of three important sub-processes of data preparation. The other two are specification and validation – specification of data transformations to address granularity, scope and structure issues; validation to confirm the intended transformations were executed.
To streamline discovery, specification and validation, follow these best practices:
- In-line profiling
- Sample-driven iterations with immediate scalability
- Structured transformation previews
- Transformation suggestions
Profiling refers to the plethora of checks that can be run against a dataset (prior to, during and after transformation). These checks provide the core mechanism for validating data transformation steps. Unfortunately, the overhead of writing profiling logic limits the use of profiling feedback. Providing more efficient and comprehensive profiling will make transformation validation more effective, increasing overall data preparation agility.
Sampling and Scalability
Data preparation in the big data context is most efficiently and effectively driven via a variety of small, manageable samples of the full dataset. Small samples enable real-time profiling feedback. While there are mechanisms for providing near real-time feedback on large datasets, they require significant preprocessing. When the goal is to transform a dataset, however, preprocessing becomes inline processing and these mechanisms cease to be near real-time.
Once a script of transformations has been authored and validated against an appropriate set of samples, it’s time to run the script at scale. Running on the entire dataset should surface any remaining data quality issues. Ideally, this is a one-click process. For example, the script of transformations authored and executed on the sample should run, as is, on the entire dataset.
Structured Transformation Previews
A close cousin to profiling is structured previews. Structured previews visualize the relevant functional aspects of the transformation in addition to its final results. For example, a transform that removes a subset of records should illustrate this deletion in the context of the dataset by highlighting the rows being deleted before applying the deletion. These structured previews help users specify and validate the most effective data transformation steps.
For users who are ramping up on data preparation as a practice, it is important that data preparation tools provide targeted transformation suggestions. These suggestions are driven by a number of factors like the user’s prior behavior, dataset metadata and lightweight user interactions. Paired with structured previews and profiling, suggestions give users the ability to quickly and expansively explore their data.
Fast forward to a few years from now, and data volumes will have increased and nearly everyone in an organization will have the access, tools and sensibility to explore data and surface valuable insights. In other words, the top of the data-driven innovation funnel will be full.
In this setting, the bottleneck of datadriven innovation will shift back to “productionalization.” Organizations will spend significant resources determining how to leverage all the great options surfaced in their data exploration efforts. Productionalization issues will come in two forms. One will be the management problem of tracking, connecting, and consolidating all the innovation options. The second will involve building the infrastructure and expertise to quickly prototype and realize the value of these options. These will both be both good problems to have.