In the wake of the recent story on how Qantas Airlines was at risk of losing valuable landing slots at London’s busy Heathrow airport, I became interested in examining flight data across the US to see if I could determine what airlines or airports had the most delays.
Finding air-traffic data turns out to be easy. The Bureau of Transportation Statistics (BTS) provides the on-time performance data of US airlines dating back to 1987. For the purposes of this analysis, I decided to focus only on recent performance, so I downloaded the files for the flights taking place in February 2015.
I now want to get an overview of the on-line performance data and to re-structure it for my analysis. In this blog, I will walk through how I used Trifacta to prepare the data in the following way:
- Clean up markup in data such as quotes
- Remove unrelated and redundant columns
- Remove the state part of city names
- Reformat timestamps
- Perform a manual lookup using the set transform
In a second blog post, I will enrich and further clean the data with Trifacta and then present a few results from my analysis of the prepared data in Tableau.
Upon registering the data within Trifacta, the system quickly examines the structure of the file and automatically infers how to split the data set into rows and columns and promote the header row of the file.
Cleaning Up Quotations
Following the initial transformation steps, I notice that there were a few columns that had quotes around the text. By selecting the quote character, Trifacta generates transform suggestions to clean up the quotes, including the one I want to use to remove quotations from all of the columns:
Removing unrelated and redundant columns
This flight performance dataset is incredibly wide with over 111 columns in the fill. Instead of dealing with all of these columns, I quickly remove the ones that are not relevant for my analysis. To do this, I use Trifacta’s column browser and column details views to explore the different columns, and then select columns that I want to remove. For example, the data contains a FlightDate column, making the year, month etc. column redundant:
Similarly, I remove other redundant and unrelated columns after inspecting them quickly using the column browser and the column details features. At the end of this process, I reduce the number of columns to 32, making the whole data set more manageable.
Removing the state from city names
Inspecting the OriginCityName columns shows that the state name is included in the city name. Below the city name, there is also a origin state column that I quickly open in the column details view.
Since there is are separate state columns for origin and destination, I decide to remove the state part from the OriginCityName and DestCityName columns. I select the example text ‘, UT’ (including comma) in the grid:
This does not cover all rows, so I select ‘, CA’ as a second example:
Trifacta’s Predictive Transformation has generalized my selection and now covers all the text that I wanted to select. Since I want to remove it, I select the ‘replace’ preview card:
The suggested transform would remove the states from the origin city. I could just repeat the same steps to also remove it from the destination city, but a faster way is to just include the destination city in the transform. I click the ‘modify’ button and add the ‘DestinationCity’ to the col parameter in the transform editor:
Exploring the data further using a combination of column details and column browser, I find that the scheduled departure and arrival time columns are using the format ‘hhmm’:
The interpretation of a timestamp as a number can lead to analysis problems like the gaps in the histogram above. I want to change this into ‘hh:mm’ to enable easy recognition in downstream tools. I perform a long-click between the 2nd and the 3rd character of an example scheduled departure time value:
Then I click the headers of the two new columns to get a merge transform suggestion. The merged column is shown at the right side of the data grid.
Next, I modify this transform in the editor to insert the colon and take a look the merged column in the column details. Trifacta now interprets the column as a time and shows hour-of-day and minute-of-hour patterns.
After dropping the source columns and renaming the merged column, I repeat the same steps for the arrival time.
Setting the cancellation reason
The cryptic cancellation code is somewhat unintuitive. I lookup the meaning of cancellation code on the BTS website. Since there are only four different codes, I write a few ‘set’ transforms to use a descriptive name instead of a code, and then rename to column to ‘CancellationReason’.
The data already looks pretty good now. So far, I have
- Cleaned up quotes in several columns
- Removed unrelated and redundant columns
- Removed the state part of origin and destination city names
- Reformatted the arrival and departure timestamps
- Replaced the cancellation code with an easy-to-understand description
In the second part of the series, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:
- Conditionally fill in missing values that can be calculated from other columns
- Change numeric columns that represent boolean values into boolean columns
- Perform an automatic lookup using another dataset
- Join airport geo-location that I get by wrangling DBpedia data
Finally, I will also present a few results from my analysis of the prepared data in Tableau. Stay tuned.