Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

My Summer Wrangling Data – Part II

August 5, 2015

The process of enriching and structuring the Tourney Results dataset is similar that of the Regular Season Results dataset from Part I of my blog post, so I’ll save myself from being redundant and show some of the steps I didn’t picture before. Like before, I’m going to combine my Tourney Results data with the Teams dataset so that I can match team names to each game rather than seeing their id number (which means nothing to me).


Here Trifacta examines the attributes in the separate data sets and suggests I match the “lteam” (losing team) column in the Tourney Results dataset with the “id” column (team id) from the Teams dataset. This will also bring in the “name” column from Teams, which will give me the name of the losing team in each row. I’ll do the same process to get the winning teams names as well.

To get the right season information, I will perform a “Lookup” command on the “season” column. I click on the column drop down menu, select “Lookup”, select the Seasons dataset, and finally select “season” as my lookup key. This process can be seen below:

Screen Shot 2015-08-05 at 6.17.23 PMI also have another dataset I want to combine with the Tourney Results dataset, Tourney Seeds, but it is not ready to be joined yet. The data set gives me info on the season (by letter again), the seed and region (W06 would be the 6th seed from region W), and the team id number. So before I can combine it with the Tourney Results set, I’d like it to show the season by year and the team by name instead of its id number. After doing the same join process as I have done twice now, I am left with this as my new dataset:


Now I have the information of each team’s seed by year and the region they played in. I’m starting to sound like a broken record here, but one last time I will be joining this dataset with Tourney Results. This time you’ll notice one slight difference, however. I want to match up the team name from Tourney Results to the team name from Tourney Seeds, but this isn’t enough—it could be matching, say, Kentucky’s seed from 1996 to the Kentucky team in 2003. Luckily, I can add multiple join criteria. So, in addition to matching team names, I will click on the plus next to “Join Keys” and add an additional criteria to match years:


Cleaning Up
After doing this, you’ll see that my columns are not in an ideal order, and they still have the column headers from the initial dataset, so I can take a few quick steps to clean this up.


In the picture above, you can see a few options I can choose from to perform on the column. I want to rename some of the columns to make things clearer, and can also drop columns that are no longer useful like the team id numbers in the “wteam” and “lteam” columns and season letters in the “season” column. In the picture below, I simply clicked on the top left icon to open the column browser, and this allows me to move columns around, hide columns, etc.


Running the Job
Now that I am done cleaning and combining both my Regular Season Results and Tourney Results datasets, I want to run the job to apply this transformation to the entire dataset. I’ll click on “Run Job” in the top right of the screen, and I get this:


I have the option to run the job to and output the results in a number of different file formats such as CSV, JSON, Avro, and Parquet, and if I click on the drop down next to “Run on Trifacta Server”, I get the option to run the job on Hadoop depending on the size of the dataset I am working with. Below are the job results for the Regular Season Results dataset.


Trifacta provides you with information such as the percentage of valid, mismatched and missing values (in this case there were none of the latter) along with the number of rows and columns, and also provides informative distributions. You’ll see there are 91,224 rows, indicating there were that many regular season games in the provided dataset. We can see the teams with the most wins over this period (not many surprises here), as well as the teams with the most losses. You’ll also see a distribution of winning scores and losing scores, with info such as the min, max, and median (median winning score of 75 and losing score of 63 sounds reasonable). You can also see the location of the winning team, with home teams winning nearly twice as many games as away teams.

Exporting My Results
The last step is exporting the dataset that has been created. At the top left of the job results screen, I’ll click Export Results.


I am given the option to export the results in JSON, CSV, or Tableau TDE. I’ll export my new dataset in a CSV file to compare the new output from what I started with. I’ll open this CSV in an Excel spreadsheet.


This data is far more valuable to me now than when I started. I can do the same with the Tourney Results and start to do some statistical analysis!

Trifacta allowed me to accomplish something that would have been extremely difficult and time consuming, especially to those without a programming background. All of this took no more than 15 minutes, and allowed me to get right into the fun part of creating my predictive model!

I hope you enjoyed and were able to follow along, and if you missed out on Part I and would like to download the datasets to wrangle on Trifacta, you can find them here on Kaggle’s website.

I want to say one last big thank you to the Trifacta team for bringing me on board and treating me as part of the family!