Last week a few of us from Trifacta ran in the J.P. Morgan Corporate Challenge. The race is a charity event that donated over $700,000 dollars last year towards a number of not-for-profit organizations.
Each racer in the event wears a number with an embedded chip that records their time and the race organizers post these times shortly after the finish. Each participating organization then gets to select groups of four runners to enter as “scoring” teams. There are three possible team configurations: an all men’s team, an all women’s team and a balanced coed team. Captains have a few days to decide which racers to enter into their scoring teams.
Being as into data as we are, we decided to use the race results data to choose our best scoring team (assuming every other organization would choose its best scoring team). The interface on the race site only returned one thousand records per gender so we started out by writing a simple web scraper in Python to pull down all of the race results. As you might expect, we threw the scraped results into Trifacta to wrangle them into shape.
After initial structuring, we immediately saw that a few of the records had been incorrectly shifted due to some inconsistencies with the site that our hastily written scraper didn’t handle well. We used Trifacta to split these out correctly and we were in business.
We noticed that out of the 10,000 runners registered for the event, only 7,042 had results. This could have been caused by a number of factors such as participants not showing up for the race, not finishing the race, or not getting picked up by the automatic timing system. We can also see which companies had the highest number of participants.
Salesforce topped the list by a wide margin with a whopping 521 finishers. Trifacta had a respectable 16 participants, especially when you consider that that’s just shy of 20% of the company.
Since the strategy is based on finish times, we turned our focus there next. Trifacta has a visual profiler that automatically creates summary visualizations and statistics that are helpful in driving a user’s understanding of the data and the effectiveness of transformation steps. We drilled into the details of our time column and found what appeared to be a poisson distribution of times with a long tail (likely due to the participants who walked the race).
One of my favorite features in Trifacta is the linking between histograms that helps users understand relationships that might exist in their data. One example in this data set is when selecting the fastest 1000 runners in each gender we noticed a bimodal distribution of finishing times emerge.
Our next step was to calculate who the top two and four fastest members of each gender were for each organization. We created boolean indicator columns that let us know whether a company had enough participants to field a team in each strategy.
Once we finished we then aggregated our runner data and computed some per company statistics like fastest/slowest times, number of runners, and scores for each of the hypothetical teams each company could field. Now that we had our cleaned up, structured, and prepared data we exported it from Trifacta and started our analysis.
We ran our cleaned results through a simple solver that took each company’s hypothetical team times and built their highest scoring (best ranking) team. The solver was greedy — it starts with the best possible slot (finishing first) and picks an organization and a team to fill that slot. Then it picks the second best and so on until every team has been assigned. The result put us at 21st place if we submitted our top four men in the Men’s category (assuming all the teams that could have ranked higher than us submitted their optimal ranking team). Not too shabby given that we had the 117th smallest team!
We’re always looking for fun ways to apply our product and are always looking for great people to join the team (especially those that agree to run in next year’s race). If the problems we’re working on seem interesting, give us a shout – we’d love to talk!