The following is a guest post from Jennie Rogers, Assistant Professor at Northwestern University.
Knowing the fundamentals of data cleaning and preparation is essential for any budding data scientist. Yet this topic is rarely addressed in data science curriculums, since they are more likely to emphasize performing analysis on already-cleaned data – without typos, mismatched data types, or missing data. This leaves students ill-prepared to work with real-world datasets owing to the well-known 80% Rule that states that data scientists spend about 80% of their time on data cleaning and preparation.
In Northwestern University’s Data Science Seminar, we teach students how to analyze real-world datasets with approaches ranging from SQL to visualization. For this class, we teamed up with the Invisible Institute’s Citizen’s Police Data Project (CPDP) to guide students through the process of answering questions about police misconduct data. The CPDP’s dataset includes complaint reports filed by civilians against officers, use-of-force reports filed by the officers, the results of the investigations of the complaint reports, and countless messy supporting documents from these investigations. The Invisible Institute, a journalism production company on the South Side of Chicago, receives this data – with periodic updates – from the Chicago Police Department (CPD) as the result of a nearly-decade long lawsuit that the institute filed and won. As a result, they have several overlapping copies of this dataset, each of which has its own method of categorizing the complaint reports.
Students worked in small groups each of which investigated a self-selected theme using the CPDP dataset. Project themes included analyzing correlations between sociodemographic data and the outcomes of complaint report investigations, identifying patterns in the early-career conduct of officers who were the subject of numerous complaint reports, and analyzing the differences in complaint reports among the individual datasets sent from the Chicago Police Department. For the latter project, the students used Trifacta to develop a mapping table between the complaint report categories in each dataset and to identify anomalies in the mappings. For example, some complaints that were initially labelled as “absent without permission” became “seatbelts”!
We also wanted to challenge the students to verify their findings with additional datasets. We used Trifacta to integrate arrest records, including the arresting officers, into the CPDP database. The students worked on a shared Trifacta instance on Amazon EC2, where teams collaborated on a single data cleaning pipeline. They started with 15 years of noisy, messy Excel files from the CPD served on an Amazon S3 bucket and harmonized the schema since it varied year over year. They also cleaned up the text fields, correcting formatting on dates, officers’ names, and case identifiers. After that, they integrated this new data with the CPDP dataset by matching the officers by name. This enabled the students to probe the dataset with questions about cover charges, or an arrest that allegedly covers up bad behavior or to justify the use-of-force. Although our findings were inconclusive, Trifacta made it easy for us to identify outliers – officers with many arrests and use-of-force reports in overlapping time windows.
Our students come from diverse backgrounds, and many of them are not programmers by training. Being able to visually navigate large datasets, identify anomalies, mismatched data types, and mappings between disparate datasets using samples and histograms made it possible for all of our students to rapidly learn the principles of data cleaning by doing. Rather than coding a cleaning pipeline from scratch, instead, they could rapidly add steps to recipes and drag and drop the recipes within their data flows. At the end of the class, I asked the students how many of them agreed with the 80% Rule of data science. Everyone raised their hand. Class dismissed!
Want to start wrangling your own data? Take a free test drive of Trifacta!