Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Wrangling US Flight Data – Part 2

May 25, 2015

In part 1 of this blog series, I shared how to locate and begin wrangling on-time performance data for flights by US airlines that took place in February 2015 using the free version of Trifacta Wrangler.

In this post, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:

  • Conditionally fill in missing values that can be calculated from other columns
  • Change the data type of numeric columns that represent boolean values
  • Perform an automatic lookup using another dataset
  • Enrich with airport geo-location data from DBpedia

Finally, I will also present a few results from my analysis of the prepared data in Tableau.

Filling in arrival delays & cancellation reasons for diverted flights

The data set documentation mentioned that the cancellation reason, the actual elapsed time and the arrival delay, are empty for diverted flights. I quickly verify this by selecting diverted flights that did not reach their destination and filtering the data grid to show only those rows:

screenshot18_1

To enable easy analysis in Tableau, I want to treat diverted flights similar to regular flights. If they did not reach their destination, they should have a cancellation code, and if they reach their destination, there should be a correct arrival delay and elapsed time.

First, I author a transform that sets the cancellation code to ‘Diverted’ for flights that were diverted, but did not reach their destination:

screenshot19_1Then I fill in the actual elapsed time using a similar approach, calculate the arrival delay for those rows and remove the diversion-related columns afterwards:

screenshot20_1Change Cancelled and Diverted into boolean columns

The ‘Cancelled’ and ‘Diverted’ columns contain boolean values, but are using 1.00 and 0.00 to encode them. To make sure they show up as a boolean category in Tableau, I change their values to true/false:

screenshot22_1Looking up the airline name

The airline names are somewhat cryptic. Fortunately, the BTS provides a sepalookup table that I download and wrangle it so it can be used to enrich the data set by matching airline codes with actual airline names.

screenshot23_1

I then lookup the code column of the airline dataset against the UniqueCarrier column of the on-time dataset:

screenshot24_1After removing the old UniqueCarrier column and renaming the new column to Carrier, the dataset contains the full carrier names:

screenshot26_1Wrangling and joining the airport geo-locations

To enable geospatial analysis in Tableau, it would be great to have the actual airport locations, i.e. latitude and longitude. However, they are not part of the original dataset. Fortunately, almost all airports are described in wikipedia, and there is a DBpedia table with airport data that can be downloaded as CSV.

The airport dataset took about 30 minutes to wrangle. I’ll spare you the details here, but this is how it looked like before and after:

screenshot28screenshot27_1

Now I join the airport geo-locations into the original dataset. I use a left outer join so I don’t drop columns for which there are no airport geolocations and verify that the number of output rows matches the number of input rows in our on-time data table.

screenshot29_1I run a Hadoop job to execute the script that I’ve constructed on the whole February 2015 dataset and export the result as a .csv file, which I load into Tableau.

Analysis in Tableau

Now that the wrangling process is complete, I pull up the transformation output in Tableau to further analyze this prepared data.
In reviewing the data, I notice flights in February that departed from the North-East and Midwest were more delayed on average than flights that departed in other parts of the continental US. I’m able to plot the different airports on a map because I have combined the air traffic data with the airport geo-locations using Trifacta.

tableau3_2For airports with more than 5,000 flights in February 2015, the NYC airports (JFK & LaGuardia) have the highest average delay with more than 24 minutes. Chicago O’Hare was problematic because it has a very high volume of flights and the average delay was also more than 23 minutes.

tableau6_2Drilling further into O’Hare, I take a quick look at the number of flights and the average delay by carrier. Whereas Frontier Airlines has the longest average delay with almost 42 minutes, they had less than 500 flights in February 2015 at O’Hare airport. Envoy Air, on the other hand, had over 5,000 flights and an average delay of over 26 minutes.

tableau8Next time you’re planning to fly out of Chicago O’Hare in February, you’ll want to make sure you schedule plenty of time for potential delays and also pick the right airline if you have options.

Wrap Up

To recap, in this blog series, I’ve analyzed February 2015 flight on-time performance data from the Bureau of Transportation Statistics. I’ve cleaned, reformatted and enriched the dataset using Trifacta and analyzed the result in Tableau.

With Trifacta, I have:

  • Cleaned up quotes in several columns
  • Removed unrelated and redundant columns
  • Removed the state part of origin and destination city names
  • Reformatted the arrival and departure timestamps
  • Replaced the cancellation code with an easy-to-understand description
  • Conditionally filled in missing values that can be calculated from other columns
  • Changed numeric columns that represent boolean values into boolean columns
  • Performed an automatic lookup using another dataset
  • Joined airport geo-location that I get by wrangling DBpedia data

With Tableau, I have explored various aspects of the data, including:

  • The geospatial distribution of departure delays
  • The departure delays at airports with more than 5,000 flights
  • The departure delays by carrier for Chicago O’Hare airport

Feel free to try this exercise out for yourself using the free trial of Trifacta and downloading the February 2015 on-time performance data from the Bureau of Transportation Statistics.

Happy wrangling!

Related Posts

Five Tips for Optimizing Self-Service Analytics on Google Cloud Platform: Part Three

So you’ve decided to transition (at least in part) your data analytics to the cloud. More specifically,... more

  |  December 2, 2019

Data Wrangling & the Art of Beer Recommendation – Part 1

As a craft beer enthusiast, every once in a while I seek out friends who might have similar tastes in beers I... more

  |  July 9, 2015

Tableau Software from the Early Days: Tableau’s first intern reflects

Tableau made headlines recently for appointing a new CEO and much of the conversation about the company... more

  |  September 8, 2016