Data School

Learn More

Data Enrichment in the Cloud – Why Data Marketplaces need Data Prep

July 29, 2020

These days, we can’t get enough of it–Cloud. Everyone is moving to, or, more precisely, has moved to the cloud for some portion, if not all, of their data analytics.

And for good reason. When you need spare compute capacity to do a one-time analytical analysis, you just spin up some servers, run your analytics and then spin it back down. No need to go through the capital budgeting process, no need to procure servers, racks, cabling and then wait for space in your data center and for someone to provision the servers, the OS,  and the network just so you can add the machines to a cluster so you can run your analysis.  The cloud takes care of all of that, which is wonderful.

And now that more organizations are firmly committed to the cloud, so are 3rd party data providers that offer datasets for data enrichment. These companies provide high quality curated financial data, weather data, health trend data or industry specific data as just a few examples. It used to be that you had to contact a sales-person of a specific data vendor for the data set you were interested in, negotiate to get a sample set of data, and then if you liked it,  purchase a subscription usually for a year at a time.

It can be a time-consuming hassle to find the right data in the first place and then complete the financial transaction. Fortunately, the process has changed recently as the large cloud vendors have begun to provide marketplaces, like AWS Data Exchange, that curate a wide variety of vendor data, all in a single location. Data marketplaces make it extremely easy to find, subscribe and import 3rd party vendor data into your analytics environment.

However, the one step that still needs to be done after you import your data is blending and prepping  that data so it aligns with the data you already have.  The reality is you can’t always rely on the data you just purchased to work with your own data out of the box, and sometimes the quality can be lacking as well.  No two vendor datasets are ever the same. And somehow the process of getting vendor data into a specific format or schema always takes longer than expected. Because it isn’t just a question of accessing the data, you still need to prepare, cleanse, and blend the data with existing datasets to make it useful.

There are two approaches that organizations typically take to address this issue, the first is Excel, and the second is code.  And while both will work, they operate at the opposite ends of the spectrum.  While Excel is very easy to use, as most people are very familiar with it, it lacks the sophistication required to manipulate large datasets, automate work, or see the data lineage.  Writing code is the polar opposite.  You can do pretty much anything you want to your data using Python, given its relatively high-level sophistication and flexibility, but this is not something the average data analyst can do well. It requires a coding expert, and even then it can be a difficult and time consuming process to explore the data’s contents and manipulate and blend the data.

Why can’t you have both ease of use AND sophistication? The short answer is, you can!

This is where data preparation platforms like Trifacta come into play. Trifacta helps analytics teams more efficiently enrich their data analysis with 3rd party data by combining the ease of use of tools like Excel with self-service automation. Trifacta offers visual guidance to help you discover and understand 3rd party data, while providing capabilities that automatically visualize your data and propose how to join and transform that data so it blends perfectly with your own.

No matter the structure or complexity of your data or the 3rd party data, Trifacta accelerates the process of blending data and getting your data ready to use.

If you want to learn more about how Trifacta can help, check out our eBook online.  It shows you how the data enrichment process is greatly simplified by using an automated data wrangling tool as part of your process. Put your 3rd party vendor data to work for you more quickly–try out Trifacta for yourself today.

Related Posts

Leveraging On-Cluster Visualization for Faster Insights

We’ve talked a lot about the data lake ecosystem on our blog to help organizations accelerate adoption and... more

  |  December 7, 2016

Brave New World for Data Transformation

Over the Christmas dinner table, everyone was catching up and chatting about their life stories in 2013.... more

  |  January 27, 2014

Data Preparation Best Practices for Snowflake’s Cloud Data Warehouse

Snowflake is known for their separation of storage and compute, which makes scaling data more efficient.... more

  |  October 1, 2019