In part 1 of this blog series, I shared how to locate and begin wrangling on-time performance data for flights by US airlines that took place in February 2015 using the free version of Trifacta Wrangler.
In this post, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:
- Conditionally fill in missing values that can be calculated from other columns
- Change the data type of numeric columns that represent boolean values
- Perform an automatic lookup using another dataset
- Enrich with airport geo-location data from DBpedia
Finally, I will also present a few results from my analysis of the prepared data in Tableau.
Filling in arrival delays & cancellation reasons for diverted flights
The data set documentation mentioned that the cancellation reason, the actual elapsed time and the arrival delay, are empty for diverted flights. I quickly verify this by selecting diverted flights that did not reach their destination and filtering the data grid to show only those rows:
To enable easy analysis in Tableau, I want to treat diverted flights similar to regular flights. If they did not reach their destination, they should have a cancellation code, and if they reach their destination, there should be a correct arrival delay and elapsed time.
First, I author a transform that sets the cancellation code to ‘Diverted’ for flights that were diverted, but did not reach their destination:
The ‘Cancelled’ and ‘Diverted’ columns contain boolean values, but are using 1.00 and 0.00 to encode them. To make sure they show up as a boolean category in Tableau, I change their values to true/false:
The airline names are somewhat cryptic. Fortunately, the BTS provides a sepalookup table that I download and wrangle it so it can be used to enrich the data set by matching airline codes with actual airline names.
I then lookup the code column of the airline dataset against the UniqueCarrier column of the on-time dataset:
To enable geospatial analysis in Tableau, it would be great to have the actual airport locations, i.e. latitude and longitude. However, they are not part of the original dataset. Fortunately, almost all airports are described in wikipedia, and there is a DBpedia table with airport data that can be downloaded as CSV.
The airport dataset took about 30 minutes to wrangle. I’ll spare you the details here, but this is how it looked like before and after:
Now I join the airport geo-locations into the original dataset. I use a left outer join so I don’t drop columns for which there are no airport geolocations and verify that the number of output rows matches the number of input rows in our on-time data table.
Analysis in Tableau
Now that the wrangling process is complete, I pull up the transformation output in Tableau to further analyze this prepared data.
In reviewing the data, I notice flights in February that departed from the North-East and Midwest were more delayed on average than flights that departed in other parts of the continental US. I’m able to plot the different airports on a map because I have combined the air traffic data with the airport geo-locations using Trifacta.
For airports with more than 5,000 flights in February 2015, the NYC airports (JFK & LaGuardia) have the highest average delay with more than 24 minutes. Chicago O’Hare was problematic because it has a very high volume of flights and the average delay was also more than 23 minutes.
Drilling further into O’Hare, I take a quick look at the number of flights and the average delay by carrier. Whereas Frontier Airlines has the longest average delay with almost 42 minutes, they had less than 500 flights in February 2015 at O’Hare airport. Envoy Air, on the other hand, had over 5,000 flights and an average delay of over 26 minutes.
To recap, in this blog series, I’ve analyzed February 2015 flight on-time performance data from the Bureau of Transportation Statistics. I’ve cleaned, reformatted and enriched the dataset using Trifacta and analyzed the result in Tableau.
With Trifacta, I have:
- Cleaned up quotes in several columns
- Removed unrelated and redundant columns
- Removed the state part of origin and destination city names
- Reformatted the arrival and departure timestamps
- Replaced the cancellation code with an easy-to-understand description
- Conditionally filled in missing values that can be calculated from other columns
- Changed numeric columns that represent boolean values into boolean columns
- Performed an automatic lookup using another dataset
- Joined airport geo-location that I get by wrangling DBpedia data
With Tableau, I have explored various aspects of the data, including:
- The geospatial distribution of departure delays
- The departure delays at airports with more than 5,000 flights
- The departure delays by carrier for Chicago O’Hare airport