Today’s blog is a guest post from Geoffroy de Viaris, a Data and Analytics Project Manager at a high fashion luxury goods manufacturer.
I’ve been working for more than 10 years with big businesses in the foodservice industry as well as luxury goods manufacturers managing, cleaning, and enhancing databases with millions of records. I’ve always had one goal in mind: provide our business with quality insights to monitor operations, inform decisions, and drive strategy. Over the years, I’ve encountered various technologies and processes, and I can say that I’ve been able to identify the best practices and tools that provide the highest impact for business. I’m now able to deliver scalable data warehouses for self-service analytics in record time.
Creating a data warehouse used to require a large team, long lead time, and a consequently large budget. Now, there is an affordable way for you to design your data warehouse quickly thanks to some innovative solutions in the cloud.
I recently delivered a webinar about this, and wanted to summarize the key takeaways to put you on the fast track with an end-to-end demo to get you started on your new cloud analytics journey. This blog focuses on the Google Cloud analytics suite and on Cloud Dataprep in particular so you can turn your raw data into gold. I also added a few best practices recommendations so you can be sure to scale from day one.
The old way: On-premise analytics
The main strategy for many years was to use Extract Transform Load (ETL) tools on employees’ computers or in the company server room, a.k.a. “on-premise”. The goal here was to prepare the data (deduplication, cleaning, combining, etc.) to create your local data warehouse. This worked quite well for some time but had a few limitations.
ETL/On-Premise limitations: Forecast, Scalability & Price
This approach can work, but it’s important to bear in mind what your current data strategy is and where your business is heading.
As time passes, the volume of data you deal with will grow, and it may be hard to forecast your data needs in advance. Having your own server room is costly, and you don’t want to pay for something you won’t need. Scaling your servers up or down requires resources and time that turn into big costs and waste. I can tell you from experience that many companies, big or small, have a hard time forecasting their data needs in the long run. Which brings us to a new approach to tackle analytics initiatives.
The new way: Cloud Analytics
Over the past few years, the cloud has increasingly been adopted by big and small businesses because of its flexibility, price attractiveness, as well as its robust and secure environment for data. Let’s see how it can fix on-premise limitations. With the cloud, you can:
- Scale-up and scale-down in real-time as required without forecasting future needs
- Pay only for what you use, which is often cheaper and more secure than doing it yourself
- Process vast amounts of data faster than on-premise, thanks to the large number of servers available to parallelize the load
- Change your value to become a business partner versus a technology geek
Getting the data ready for analytics
With the cloud, a new generation of data services has been designed specifically to take full advantage of its benefits. This blog post uses Google Cloud to demonstrate data preparation as a new approach to designing and automating a data warehouse pipeline. In the Google Cloud, we can leverage Cloud Dataprep by Trifacta, which is the native tool Google Cloud provides for data preparation. It allows your teams to collaborate easily and save valuable time in getting data ready for analytics.
Let’s get started with the basic concepts of data preparation for data warehousing to turn your raw data into refined data (cf. Fig.1).
Raw data refers to your input data before it has gone through preparation (structuration, cleaning, enriching, validation, deduplication, combining, etc.). Input data can come from various sources, such as files (Excel or Google spreadsheets, CSV, etc.), Cloud Services such BigQuery (Google Cloud Data Warehouse), Google Cloud storage, and additional databases and applications.
Refined data refers to the prepared data that often resides in your Data Warehouse. Once data has been refined, it’s ready to be used in various ways. Some examples include:
- As a backend database or for direct analysis
- Implemented into a dashboard, with tools such as Data Studio, Tableau Software, QlikView, etc.
- Used to build machine learning models
The Cloud Dataprep user experience and concepts
To use Dataprep, it’s useful to get familiar with a few key concepts:
- The library is where you import your datasets (files and databases)
- A recipe is a list of steps that transforms one dataset into another
- The flows section is where you can arrange recipes and link them together
- The jobs section is where you run your recipes and flows and monitor the results
Guiding you to your first dashboard
Next, we’ll go through the steps required to leverage Cloud Dataprep to transform your raw data into refined data in Google BigQuery. We’ll use an e-commerce sales analytics example where we want to display the refined data in an attractive dashboard.
For reporting and dashboarding we’ll use Data Studio, Google’s free data visualization service. Once we’ve gone through the steps, the final result will look like this:
Below is a recorded demo you can follow to design your end-to-end data warehouse from data acquisition to dashboarding.
Here are all the files you need to try out this demo for yourself :
- The source files
- The Cloud Dataprep flow – go to the flow view and import the file
- The Data Studio dashboard (make a copy of it so you can customize it and connect to your data source)
If you don’t have a Google Cloud account, you can try it for 12 months with $300 worth of Google Cloud credit here: https://cloud.google.com/free
Bonus: The Top 5 Dataprep tips to scale fast
After using Cloud Dataprep for some time, you’ll notice that it can save you a massive amount of time. However, you can get even more out of it by following a few rules of thumb. I’ve compiled additional tips and best practices you can use early on in your data preparation journey. With these tips, you’ll be able to scale with confidence and facilitate broad adoption and collaboration in your organization with peace of mind.
1. Document your steps with comments
This might seem obvious to say, but more often than not, when you create your transformation steps, you know what you’re doing. So adding comments doesn’t seem necessary when you’re building the recipe. However, after adding dozens or hundreds of steps over a few weeks, you might come back to a specific step and not remember why you did it or why it’s in that particular order. Commenting also makes it easier for multiple people to collaborate on the same recipes. And yes, you can share data preparation flows and invite colleagues to contribute to refining the data.
Remember to not only describe in your comment WHAT you do but also WHY also you do it.
To add a comment in a recipe, just add a new step and type comment in the search field. Or, you can click the “Comment” icon on the icon bar.
2. For each dataset, define a “cleaning” recipe
When creating a flow, you might import 5 or 10 datasets and create even more recipes. And no matter how clean you may think the data is, it will usually require some additional cleaning and formatting before you can use that dataset in more complex operations such as joining or grouping. A good practice is to always add a cleaning recipe as close to the source as possible (even if at first you think you won’t need it).
3. Split your steps into multiple recipes wisely
If a recipe can have as many steps as you want, why not have all the steps in one big recipe? This, in fact maybe a bad decision. It is better to regroup all the steps achieving a common task in separate recipes. You will future-proof your data preparation flows. This facilitates unit testing, provides more readability to your flow, and helps with maintenance over time. These benefits apply not only to your cleaning steps but also to your enriching steps (like joining with an external dataset), your calculation steps, your aggregation steps, and others.
Remember that after each recipe, you can output the result with a job. This is particularly useful when testing a specific output or refreshing a sample to verify some data. You can also create multiple recipes, which branch out your flow as follows:
4. Create recipes dedicated to joins
The join step is a complex one. Joins will not only impact your flow but also the time it takes to open a recipe in your browser. For example, having your join steps in separate recipes will save you some time by letting you load a certain join step only when you really need to.
5. Backup your flows
This tip might also seem trivial, but at the end of the day, mistakes happen. So use the “Export Flow” feature to your advantage and define a backup strategy for all your flows. You’ll be happy you did when you need it.
Backup your flows according their importance and how often they change; the more important the content is, the more it gets updated, and the more you should back it up.
A good rule of thumb is that you should backup everything monthly, backup important flows weekly, and do a backup after each important change.
With these tips in mind, you can fast-track your data preparation practices with scale in mind, saving you additional time and ensuring your team members can work and collaborate much more efficiently!
I hope these best practices and quick guide will help you set up your own data warehouse in the Google Cloud with Cloud Dataprep. You’ve learned how to prepare your data for BigQuery and visualize it in Data Studio. Finally, now you also know some fundamental tips to take your Dataprep skills to the next level.
If you want to go one step further, you can read these other blogs:
- How to Automate a Cloud Dataprep Pipeline on File’s Arrival (leveraging Cloud Functions)
- Automation of Data Wrangling and Machine Learning on Google Cloud (leveraging Cloud Composer and BigQuery ML)