Wrangle Summit 2021 On Demand

You can still experience the best people, ideas and technology in data engineering, all in one place

Get All-Access Pass
 
All Blog Posts

Big Data Comes in all Shapes… and Sizes

October 9, 2014

It’s no secret that the aggressive use of data can provide major competitive advantages. Effective data-driven organizations employ technologies that make data easy to capture, and offer that data up for the entire organization to access, understand and analyze. In recent years, it has become clear that these “Big Data” technologies are used for a wide variety of data sets. In practice, Big Data comes in all shapes … and sizes.

For Big Data deployments to succeed, you have to (a) understand the causes and nature of this variety of shapes and sizes, and (b) provide analysts with the technology to take data of any size or shape, and transform it into useful forms.

How Big is “Big”?
To start with, how big is “Big” Data? Well, it’s all relative—even within an organization.

As an illustration, a friend recently posed the following thought experiment: how much storage would it take to maintain a bank balance for every person on the planet?  The arithmetic is in the footnote, but here’s the answer: 104GB of data, uncompressed. That’s right, a global-scale bank balance data set fits on the iPad.

By contrast, Google’s web index is reportedly some 2 million times larger: about an Exabyte big (1024 Terabytes).

Two global-scale, alphanumeric data sets—both useful—span a factor of 6 orders of magnitude in size. For perspective, that’s roughly like comparing the height of arguably the tallest player in NBA history, Manute Bol (who is, without question, Big), and the distance across America. The definition of Big depends entirely on your perspective and what you’re trying to accomplish.

We see similar variance in dataset sizes among our customers at Trifacta, and importantly we see it within individual customers. The key to note here is that size determines what data professionals can expect to accomplish, and therefore how they work. When you’re crunching 100 Terabytes, you expect to go out for coffee … and dinner … and perhaps a good night’s sleep. What that means is that you need a reliable batch processing technology like MapReduce, and a design discipline that involves a fair bit of preliminary work, typically on samples of your data. By contrast, if you’re ticking through a hundred data sets, each of which can fit in memory, you want to work on each dataset as a whole at the speed of thought: stay as agile and informed as possible, using interactive technologies.  Fortunately, the Hadoop community has come to understand the importance of supporting a variety of scales, and has delivered not only batch-processing MapReduce, but also interactive engines like Spark, Impala, Tez and Stinger.

The maturation of these engines is an important development for the Big Data ecosystem.  Organizations investing in Big Data can benefit enormously from this progression—but only if they empower their data professionals with technologies up the stack that can leverage the right engine and methodology for the right job.

The Shape of Data Today
Big Data software doesn’t discriminate—it makes it easy to store data in any structure or format. Organizationally, this “schema on use” approach is fantastic. Everybody in the organization can use the same storage infrastructure—gather at a single watering hole—and share resources and expertise. When an individual or team decides to invest time and energy into wrangling a particular data set into shape, everyone can benefit from the work they put in, and the data products they generate.

Of course, this diversity presents technical challenges. The Big Data customers we see at Trifacta deal with data coming in all kinds of layouts: ragged unstructured text files like UNIX system logs, irregular and nested semistructured data in formats like JSON, tabular data in text formats like CSV, and the list goes on. Meanwhile, we see growing adoption of a variety of serialized and compressed storage formats that the Hadoop community is embracing, including Avro, Parquet and ORC. And of course once an organization has made sense of a data set, it should consider describing it via metadata in standards like HCatalog, so it is accessible to a wide range of end-user tools for querying and analytics.

By contrast, analytic tools—including both predictive analytics and BI visualizations—fundamentally need their input data to be in a tabular format.  After all, the job of those tools is to “run the numbers”. That requires getting the numbers together into rows and columns.

As a result, somebody inevitably has to take the disorderly Big Data inputs, and wrangle out the relevant parts into structured formats for analysis. This explains why data transformation is such a critical part of the Big Data ecosystem, and why analysts report spending up to 80% of their time transforming data.

Data Transformation: Extracting Value from all Shapes and Sizes
A healthy Big Data environment begins with an investment in data storage, but must lead to payoffs via aggressive data usage. The path from storage to usage goes directly through data transformation. The progression is straightforward:

  1. A shared watering hole of format-agnostic Big Data storage makes data capture easy, which attracts users across the organization.

  2. The capability to easily capture data attracts more—and more interesting data—over time. Inevitably, that data comes in a wide variety of formats.

  3. Data transformation is required to wrangle that variety of data into structured formats and features for analysis.

So data transformation is a critical task, and platforms to support the task need to address the broad variety of users who gather at the watering hole: whether they’re data scientists, data engineers, or business data analysts.

Moreover, it’s important to realize that every analytic task is undertaken with a business purpose, and data transformation has to be performed with that purpose in mind.  For example, Trifacta is often used to transform raw logs of customer interactions with a product. But these logs get transformed in different ways for different purposes within the organization. Product designers, for example, may need specific detailed aspects of product usage: say, the time that elapses between each user button-press, and the corresponding result being shown. These extracted metrics may then be joined with data from customer support records to drive an analysis of the correlation between product performance and customer satisfaction. By contrast, product marketers may be more interested in coarser-grained customer engagement—say, what times of day the customers tend to use the product—and join that information with customer demographics to do better-targeted advertising. This diversity of purposes means that going from raw data to business value is not a mechanical process: it requires empowering people to make sense of a variety of data, and manipulate that data in purpose-driven ways to drive a particular analytic or BI process.

And of course, as I mentioned up front, the scale of the datasets being transformed has to be taken into account . The ability to choose the right engine technology—whether batch or interactive—to fit the size of data and type of work you’re doing is a requirement for effective data transformation.

Trifacta v2
In Trifacta v2, we’ve worked hard to ensure that our customers can use our groundbreaking data transformation technology across the broad range of shapes and sizes they deal with in the Hadoop ecosystem. The Predictive Interaction™ technology we initially brought to market in v1 has evolved with the introduction of new Visual Profiling features, resulting in the most advanced and elegant interface ever developed for assessing and manipulating data. And now we’re proud to provide what is by far the most complete range of coverage for the Big Data landscape from any data transformation product. Trifacta v2 works with a wide variety of data formats spanning from structured to semi-structured to unstructured. We enable processing at a variety of scales as well: massive data is supported by MapReduce, and data that fits in memory is supported by Spark. Thanks to extensive input from our customers and the hard work and innovation of our team, Trifacta v2 is designed to match the realities of Big Data today—in all its shapes and sizes.

*There are approximately 7 billion people on earth. A unique ID for each person would require a 64-bit (8 byte) integer to represent. So using 8 bytes * 7 billion people = 56 billion bytes, we can store the personal ID of everyone on the planet.  Bill Gates’ net worth is approximately $81 billion, which requires another 64-bit integer to represent, so let’s store a 64-bit (8-byte) integer for each person on the planet: another 56 billion bytes.  Dividing successively by 1024 we get 112 billion bytes = 109,375,000 KB = 10,6811 MB = 104 GB.