Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Wrangling US Flight Data – Part 1

May 12, 2015

In the wake of the recent story on how Qantas Airlines was at risk of losing valuable landing slots at London’s busy Heathrow airport, I became interested in examining flight data across the US to see if I could determine what airlines or airports had the most delays.

Finding air-traffic data turns out to be easy. The Bureau of Transportation Statistics (BTS) provides the on-time performance data of US airlines dating back to 1987. For the purposes of this analysis, I decided to focus only on recent performance, so I downloaded the files for the flights taking place in February 2015.

You can follow along yourself by downloading the files here and requesting access to a trial account of Trifacta.

I now want to get an overview of the on-line performance data and to re-structure it for my analysis. In this blog, I will walk through how I used Trifacta to prepare the data in the following way:

  • Clean up markup in data such as quotes
  • Remove unrelated and redundant columns
  • Remove the state part of city names
  • Reformat timestamps
  • Perform a manual lookup using the set transform

In a second blog post, I will enrich and further clean the data with Trifacta and then present a few results from my analysis of the prepared data in Tableau.

Upon registering the data within Trifacta, the system quickly examines the structure of the file and automatically infers how to split the data set into rows and columns and promote the header row of the file.

Cleaning Up Quotations

Following the initial transformation steps, I notice that there were a few columns that had quotes around the text. By selecting the quote character, Trifacta generates transform suggestions to clean up the quotes, including the one I want to use to remove quotations from all of the columns:

screenshot1_thumb3

Removing unrelated and redundant columns

This flight performance dataset is incredibly wide with over 111 columns in the fill. Instead of dealing with all of these columns, I quickly remove the ones that are not relevant for my analysis. To do this, I use Trifacta’s column browser and column details views to explore the different columns, and then select columns that I want to remove. For example, the data contains a FlightDate column, making the year, month etc. column redundant:

screenshot3_thumb4

Similarly, I remove other redundant and unrelated columns after inspecting them quickly using the column browser and the column details features. At the end of this process, I reduce the number of columns to 32, making the whole data set more manageable.

Removing the state from city names

Inspecting the OriginCityName columns shows that the state name is included in the city name. Below the city name, there is also a origin state column that I quickly open in the column details view.

screenshot5_3

screenshot10_thumb3

Since there is are separate state columns for origin and destination, I decide to remove the state part from the OriginCityName and DestCityName columns. I select the example text ‘, UT’ (including comma) in the grid:

screenshot6_thumb1

This does not cover all rows, so I select ‘, CA’ as a second example:

screenshot7_thumb

Trifacta’s Predictive Transformation has generalized my selection and now covers all the text that I wanted to select. Since I want to remove it, I select the ‘replace’ preview card:

screenshot8_thumb1

The suggested transform would remove the states from the origin city. I could just repeat the same steps to also remove it from the destination city, but a faster way is to just include the destination city in the transform. I click the ‘modify’ button and add the ‘DestinationCity’ to the col parameter in the transform editor:

screenshot9_thumb1

Reformatting timestamps

Exploring the data further using a combination of column details and column browser, I find that the scheduled departure and arrival time columns are using the format ‘hhmm’:

screenshot11_thumb1

The interpretation of a timestamp as a number can lead to analysis problems like the gaps in the histogram above. I want to change this into ‘hh:mm’ to enable easy recognition in downstream tools. I perform a long-click between the 2nd and the 3rd character of an example scheduled departure time value:

screenshot12_thumb1

Then I click the headers of the two new columns to get a merge transform suggestion. The merged column is shown at the right side of the data grid.

screenshot13_thumb

Next, I modify this transform in the editor to insert the colon and take a look the merged column in the column details. Trifacta now interprets the column as a time and shows hour-of-day and minute-of-hour patterns.

screenshot15_thumb

After dropping the source columns and renaming the merged column, I repeat the same steps for the arrival time.

Setting the cancellation reason

The cryptic cancellation code is somewhat unintuitive. I lookup the meaning of cancellation code on the BTS website. Since there are only four different codes, I write a few ‘set’ transforms to use a descriptive name instead of a code, and then rename to column to ‘CancellationReason’.

screenshot16_thumb

The data already looks pretty good now. So far, I have

  • Cleaned up quotes in several columns
  • Removed unrelated and redundant columns
  • Removed the state part of origin and destination city names
  • Reformatted the arrival and departure timestamps
  • Replaced the cancellation code with an easy-to-understand description

In the second part of the series, I will continue to cleanup the data and enrich it with airport geo-locations and airline names:

  • Conditionally fill in missing values that can be calculated from other columns
  • Change numeric columns that represent boolean values into boolean columns
  • Perform an automatic lookup using another dataset
  • Join airport geo-location that I get by wrangling DBpedia data

Finally, I will also present a few results from my analysis of the prepared data in Tableau. Stay tuned.

Related Posts

Summer at Trifacta: A look into our 2016 internship program

With back to school season in full swing, we’d like to take a moment to thank our college interns for... more

  |  October 14, 2016

Wrangling Exercise Bike Data

Getting in better shape and losing weight are important to many people – in fact, they are the top 2... more

  |  February 18, 2016

A Product Perspective

In continuing our employee interview series, we chatted with Product Manager Alon Bartur to discuss why he... more

  |  June 23, 2014