Trifacta Legends Hall of Fame

Our monthly series showcasing users who are doing legendary work with data

Learn More
All Blog Posts

Wrangling Census Data to Predict the Location of Amazon HQ2

July 12, 2018

Last fall Amazon announced its search for a second headquarters, considering applications from a whopping 200 cities. Now, those initial 200 have been narrowed down to 20 “Amazon HQ2” hopefuls that span the country and include one Canadian city, Toronto. For these shortlisted cities, there is a lot at stake—the winning city stands to gain 50,000 high-paying jobs over the next few years and tens of billions of dollars’ worth of investment.

There has been much speculation as to which city has the best chance of being selected, but here at Trifacta, we decided to take a data-driven approach to predicting Amazon HQ2. Based upon publicly available Census data, and by leveraging Wrangler, we measured how the cities on Amazon’s shortlist stack up in terms of population, education, key occupational profiles and diversity in order to make an educated guess about the ideal HQ2.

Based on our analysis, we have concluded that the Washington DC metro area is the best choice for Amazon’s second headquarters. Within this area, we believe Northern Virginia to be the best sub-location. The rest of this article will describes how we arrived at this conclusion.

Figure 1: Amazon’s Shortlist (Source: Amazon)

We used data provided by the U.S. Census Bureau’s American Community Survey as the basis of our analysis, specifically pulling from the Public Use Microdata Sample (PUMS), which provides rich demographic detail. Relying solely on U.S. Census data meant excluding Toronto from consideration, but we agreed that was acceptable—given the political implications of an international choice, we believe the odds of Toronto winning are very low. That left us with 19 remaining candidates and, using Trifacta’s data preparation platform, we got to work.

The high-level steps that we took to prepare the data in Wrangler were as follows:

  1. First, we imported the 5-year PUMS data for the entire U.S., which was split into four files.
  2. Then, we brought in a mapping table that maps Public Use Microdata Areas (PUMAs) to Combined Statistical Areas (CSAs). This table was produced using a “spatial join” in QGIS, an open source geographical information system.
  3. Next, we combined the four files from step one into a single dataset by using the “union” function in Trifacta and performed a lookup with the results of step 2 to add CSAs to the core dataset.
  4. We limited the dataset to only the necessary rows and columns, and prepared a lookup table from the PUMS data dictionary. We used this lookup table to convert codes in the core dataset to human readable text.
  5. Finally, we ran a job in Trifacta to produce a .TDE file for visualization in Tableau to assess the outcome of our data.

Once in Tableau, the first step was rank-ordering these cities by population. In order for a city to provide Amazon with a deep labor pool and adequate support infrastructure, size is a leading indicator. To do this, we utilized the construct of Combined Statistical Area defined by United States Office of Management and Budget, which takes into account a major city and its surroundings in a consistent way.

Figure 2: Population

Note that some of the “cities” on Amazon’s shortlist (Newark, Montgomery county and Northern Virginia) are actually part of another Combined Statistical Area.

The next key criterion is the availability of a well-educated labor force, which is a fundamental requirement for a robust talent pool. Here is how the cities stack up in terms of proportion of population with bachelor’s degrees or higher.

Figure 3: Education

Washington DC ranks 3rd overall. But when you specifically look at STEM skills, Washington DC rises to the top. For a technology company like Amazon, we believe that access to a rich pool of STEM talent is a key enabler for future growth.

Figure 4: STEM Rank

The census data also provides an alternate perspective of the talent pool based on occupational profiles. In particular, there are five occupational profiles that Amazon has identified as being particularly important: software development, legal, accounting, management and administrative. Washington DC has a robust labor pool in all of these occupations.

Figure 5: Occupational Profile

Having a diverse and multicultural workforce can provide a distinct advantage for a company that wants to grow rapidly and encourage innovation. For this reason, we looked at diversity, both in terms of gender and racial identities. In both areas, Washington DC shines due to its vibrant and diverse workforce.

Washington DC tops the charts in terms of female talent in the “Computer & Math” occupational group. Washington DC also has a diverse population with many races and cultures.

Figure 6: Gender Diversity

Figure 7: Racial Diversity

Due to the reasons listed above, we believe the Washington DC metro area is the ideal location for Amazon’s second headquarters. The Combined Statistical Area consists of 3 cities in Amazon’s shortlist: Washington DC, Northern Virginia and Montgomery County (Maryland). While all 3 are excellent choices, we believe the selection process between them comes down to qualitative factors. We picked Northern Virginia, because it is already the site of one of Amazon’s largest data centers (US-East) and enjoys relative proximity to top-notch Virginia universities that will provide an ongoing source of tech talent.

Please see this interactive dashboard for more details and ability to drill into the data used in this article.