Arushi Arora is a Data Science Master’s student at Columbia University. She has an undergraduate degree in computer science and engineering from India, where she first learned to love data. Arushi enjoys using data to learn new things about the world, from floods and meteorites to baby weights and Foursquare check-ins.
I was first introduced to Trifacta Wrangler by a professor who taught a class called “Storytelling with Data.” At the time, we were working with a messy dataset about meteorite landings observed on Earth, which had inconsistent date-time fields—some entries contained only the date, while others had both the date and time. At first, we tried to standardize the data in Python, but that was tedious and time-consuming. It meant continually checking for inconsistencies and then scripting the remaining variations. Once we adopted Trifacta Wrangler, however, it took less than a minute to clean the entire dataset. I was pretty excited, to say the least.
Now, I’m using Trifacta Wrangler for lots of other projects. In my “Exploratory Data Analysis and Visualization” course, for example, we were given a dataset on global flood records, which contained data such as the country the flood occurred in, its main cause and the total damage caused. One of the value types had multiple variations to describe what kind of rain caused the flood; “Heavy Rain,” for example, was spelled as “Heavy rain”, “heavy rain”, “HEAVY Rain” and even “Monsoonal Rain.” Much like the previous dataset, I faced the same issue when cleaning the data in Python—I’d have to first find all of the variations in order to create a code to replace them. However, rather than spending two hours to complete this task using R, with Trifacta Wrangler, I standardized my data in ten minutes. Finally, once my data was cleaned in Trifacta Wrangler, I used R or D3.js for visualization.
Below, I’ve included a step-by-step guide of how Trifacta Wrangler helped clean my flood data.
1) Upon launching Trifacta Wrangler on my desktop, I created and named my data set.
2) Then I checked out the mismatched values, highlighted for me in red. Here, the glide column’s data is being read as a binary variable, causing the mismatch. Once I changed it, the issue was resolved.
3) Then, I got rid of columns that contain zeros or insignificant variables by using the drop function.
4) With the “replace” suggestion card I was able to easily get rid of the blank spaces in front of each country.
5) Column Details show me how frequent the word appears in a column. It’s here that I noticed there is one value that shows up only once: “Bangledesh”. I quickly identified it as a spelling error and changed it to the correct spelling, “Bangladesh”.
6) Additionally, column details provides insights into which countries are affected by floods most frequently. Over the course of October 2001 – January 2016, the top two flooded countries were the United States and China.
7) From here, I decided it was important to my analysis to keep time and date as separate columns. I was able to easily do this by highlighting one of the dashes in between the two variables. This led to the split suggestion, which I selected and added to my script.
8) Once I finished my transformations, I was able to choose to export my data as a CSV, JSON or TDE file.
9) The “Global Floods” dataset stayed on my Trifacta homepage, where I was able to see a job results summary with an overview of my data.
All in all, learning to use the tool was incredibly easy, even for those who don’t understand how to write commands into the Script. If you don’t know how to write your own commands, you can either trial-and-error with various versions or simply perform a selection to be prompted a series of transformation suggestions to add to your Script.
To try out Trifacta Wrangler for yourself, download it here. And if you have a story to tell, we’d love to hear it! Email us at firstname.lastname@example.org and you could be featured as our next guest blogger.