This summer I have had the privilege of spending the last two months interning at Trifacta, working hand-in-hand with the marketing team on a variety of initiatives. I have long been intrigued by big data and all the hype surrounding it, so it was a great opportunity for me to learn a ton about an industry that fascinates me and listen first hand to many of the brightest minds leading the charge to a new world of big data analytics. Two months ago, I knew little of what the term “data wrangling” really meant, now I can’t imagine an organization that doesn’t do some sort of wrangling every day. One of the perks of my internship was being able to spend some hands on time with the product, and so I thought there would be no better way to close out the incredible experience I have had here at Trifacta then to share one of the many skills I have learned along the way – Data Wrangling.
As an avid sports fan, I decided to focus on something every fellow sports fan is familiar with, March Madness. Everyone wants a competitive advantage when it comes to filling out their March Madness bracket, but going off gut intuition is unlikely to be successful and copying your favorite sports analyst is unfulfilling. What about doing your own statistical analysis to find trends in historical data and apply that to a predictive model? At face value, this may seem fairly complicated. Of course, gathering the data, enriching it, cleaning it, and structuring it can be half the battle and a huge chunk of the time.
For this analysis, I used data gathered by Kaggle.com that was used in an old competition completed before the 2014 NCAA tournament. In this particular competition, the goal was to create a machine learning algorithm that would best forecast the results of the 2014 NCAA Division I basketball tournament. Kaggle did not provide perfectly clean data. Instead, they provided multiple datasets, which, on their own, are pretty much meaningless. Below you will see an example of one of the provided datasets.
The dataset above, regular_season_results.csv is one of six provided data sets. The other five are tourney_results, tourney_slots, seasons, teams, and tourney_seeds, all are in a .csv file format. As you can see from the dataset above, the information is a bunch of numbers and symbols—hard to perform any statistical analysis on the data with that information. Instead, the competitors were tasked with first manipulating the datasets into something useful.
Cleaning, structuring and enriching all these datasets would not be as visual or intuitive using a coding approach such as Python or R. Instead, I will be using Trifacta to make this process much quicker and easier.
Beginning the Transformation
I first have to load all the datasets into Trifacta from my desktop. Doing this is quite easy, you simply upload the file and name it; the result of uploading and renaming the six datasets I will be working with is shown below.
For my first transformation, I will focus on three of the datasets: Regular Season Results, Teams, and Seasons. Regular Season Results, which was pictured earlier, is a dataset with the season (sorted by letter A through R), day number (with day 0 being the start of the season), winning team’s id number, winning team’s score, losing team’s id number, losing team’s score, location of winning team (home, away or neutral) and lastly number of overtime periods. The Teams dataset contains a list of team id number’s and the school matching that number (alphabetical from 501 to 856). The Seasons dataset contains the season (1995-6 to 2012-13) along with the matching letter for each season (A through R) and the four locations for the tournament (i.e. East, Midwest, South, West). Clicking “Transform” on Regular Season Results, we are presented with the following:
Trifacta recognized the dataset as a csv file and separated the columns by the comma, and automatically inferred the headers and data types of each column. But there is not enough information in this set to be of use to me, so I will click on the hammer icon in the bottom left next to the Transform editor and choose join to begin to combine this dataset with the Teams and Seasons datasets.
Joining the Datasets
Joining in Trifacta is very simple, I simply select the dataset I want to join against and Trifacta will automatically examine the two datasets and recommend the appropriate key to execute the join. In this case, I would like to match the “id” column in Teams to the winning team and losing team columns of Regular Season Results, which will pair the winning team’s id to that team’s name and likewise the losing team’s id to that team’s name. After joining the two datasets, reordering the columns, and dropping the columns that are no longer important to me, I now have the following:
You can now see that I have a school matched with the winning team followed by their score, as well as a school matched with the losing team followed by their score. However I still have a letter representing the season, which isn’t very helpful. To move forward, I can perform a lookup against the Seasons dataset. To do so, I’ll need to click on the drop down in the column header of the column “season” and choose “Lookup”, then select the Seasons dataset and choose “season” as my lookup key. After performing this lookup, I now have the following dataset:
You can see I now have a lot of valuable information, compiled into a format that I can use.
To gain an additional level of insight on the teams with the most wins, I use Trifacta’s column details feature to profile the content of the “Winning_Team” column:
Mind you Trifacta is working with a random sample from the full set of data—the transformation I create on this random sample is then applied to the entire data set once I am satisfied—so the number of wins may be misleading. But of the top values (number of occurrences) in the “Winning_Team” column you can still see familiar powerhouses like Kentucky, Kansas, Duke, UNC, Syracuse, etc.
So now my regular season data is in a state that has value to me, but my goal is to use this regular season data to help predict outcomes in the NCAA tournament data, so I’m going to do the same process on a separate Tourney Results dataset.
I’ll walk through the rest of the wrangling I did for this analysis in Part II of this blog post, which will be posted later this week. In the meantime, I encourage you to try this out for yourself by signing up for Trifacta Wrangler and downloading the NCAA Basketball data from Kaggle.com. I also recommend creating an account on Kaggle.com if you are interested in participating in upcoming data science competitions.