As a craft beer enthusiast, every once in a while I seek out friends who might have similar tastes in beers I am interested in trying out. With over 3500 microbreweries in the United States alone, relying solely on friends for beer recommendations is not a scalable approach!
Fortunately, we have crowd sourced data of beer reviews from the website BeerAdvocate.com, where beer aficionados from all over the world have rated and critiqued beers. There are close to 1.6 million reviews from 1999 to 2012 spanning almost 66,000 different beers globally.
So, what does the raw review data look like?
The raw reviews file is just under 2 GB and is made up of semi-structured rows of text that contain data quality issues such as special characters and missing rating values for many of the reviews. So before I can do anything meaningful with this data, I need to put my Wrangle skills into action and tame this beast of a dataset with Trifacta.
Loading & initial peek at the dataset
I have loaded this dataset on my Hadoop cluster and upon bringing it into Trifacta for the first time, I get a similar look and feel of the data as displayed above.
Data preparation is an iterative task, especially when dealing with large scale datasets and building predictive models. Therefore, I start off with a random sample of the data and as we go through this data preparation process, I am able to continuously resample the data to ensure I am working with a statistically relevant sample. Also note, as I work through manipulating the data, you will see each transformation step I add to the script is written in our Domain Specific Language, Wrangle, designed to abstract users from the underlying code or execution framework that ends up executing the transformations at scale.
The current split for creating new records is based on a newline ‘n’ character. However, with a slight change in logic, we can make an intelligent split as shown below where all the reviews have been flattened and I end up with a single row for each review.
I then carry out a series of interactions on the data to extract out some of the key fields that I am interested in more closely analyzing. As I do this, I am starting to get a better picture as to the resulting format of this data I want to build for my analysis – this beer dataset is beginning to get some life! Trifacta’s histograms and data quality bar feature enable me to more quickly understand how my data is distributed and the quality of each feature of the data. This continuous visual feedback based on my interactions lets me adjust what I am doing at any step in the process.
I go through the steps of renaming the field names as I extract them. Trifacta is also doing data type inference on these newly extracted fields and the data quality bars are starting to give me some initial indicators on the quality of these attributes. As shown below, the Aroma field has a red band indicating some mismatched values. This is due to the fact that, by default, Aroma was chosen as an Integer type even though those records have decimal values.
I can then go ahead and choose the data-type I wish to assign to Aroma out of the many possibilities, such as Primitive data types (Decimal in this case), pre-defined dictionaries, custom user defined types, and even custom dictionaries.
Please stay tuned for part two of this blog series as I walk through how I complete the wrangling process for this beer review data set and dive into further analyzing the data.
If you want to try this out for yourself, I encourage you to sign up for the free trial of Trifacta and kick off your own wrangling experiment.