Jean-Philippe Gouigoux has spent several years as the lead technical project manager specializing in debugging, software architecture and integration of R&D. He has authored numerous published works and is currently working on a book. You can view his blog here for updates on his book as well as the French translation of this article. Be sure to click on the ‘PDF’ icon in JP’s blog to view the full article.
In the past couple years, my taxes have increased considerably. I wanted to know if my incomes had increased as much, more, or less—I had a vague idea of this, but, since my account files didn’t contain any clear categories, it didn’t make for an easy analysis. I also wanted to know which part of my expenses were taxes vs. leisure vs. mandatory expenses. Since I have additional income from renting a B&B, as well as from my IT books and videos, I wanted to understand how this participated to the balance with all its variations.
At the same time, I was writing a book on Open Data, which covers a variety of tools for data cleansing and analysis. Apart from the traditional Excel and QlikView, I wanted to try something different, and my wanderings brought me to Trifacta Wrangler, which definitely is different from the traditional tools.
First of all, its ergonomics are extremely well thought of. It began as a research project at Stanford University from Ph.D’s Joe Hellerstein and Jeffrey Heer with their then Ph.D student Sean Kandel, with the goal to guide the user through suggestions on the data. This approach is extremely intuitive and helps to achieve great results quickly. Since then, Joe, Jeff and Sean launched Trifacta to continue their work in building a tool that end users of all skill levels could use.
And now, let’s explore the tool!
The initial Excel file used to follow up on expenses and income looks like the following:
Of course, the content has been anonymized so that the amounts do not convey any private information. For the most curious readers, this operation has been performed through the following macro:
In order to avoid encoding issues due to French special characters (Excel only records in UTF-16, while Trifacta Wrangler reads UTF-8), the Excel spreadsheet is exported in a CSV file (using semi-colon as the separator) and then encoded in UTF-8:
The input file is now ready, and we can proceed to the first operation in Wrangler itself, which will be the creation of the associated dataset. Once logged into Wrangler, we use the Create Dataset command:
One click on Create and Transform brings us to the transformation workshop of Wrangler:
The setup is done and we will start analyzing and manipulating the data.
Cleaning unnecessary data
We will start with removing clutter from the input file, taking the columns one after the other. Trifacta Wrangler is particularly helpful, since it not only recognized the column type, but sorts out the content and shows which part is correct, which is not and which is empty. In the case of the first column, the type has been correctly determined as a date, and Wrangler indicates there are some missing values:
From the way the file has been created, all lines without a date should be considered as clutter, as a financial movement always has a related date. Let us click on the black zone at the top of the column, corresponding to the missing values. The result is that these lines are selected:
Instead of the suggestion of keeping the lines, we will remove them by selecting the second block in the suggestions list, named Delete, and clicking on Add to Script. While we are at it, the columns called Balance and Comments have no use in our analysis, so we will suppress them as well. We start by selecting them by simply clicking on their title:
The suggestion (Drop) is the right one, and we accept it by clicking again on Add to Script. The data seems nice and clean, but there is still a little something we should take care of, as the spectrum of the Date column shows:
There seems to be an incorrect date located between 1900 and 1904, which should be a false input, as the bank account has started being traced in 2012.
The simplest way to correct this is to go back in the CSV file, modify the entry, and then save it. The detection of the line that causes a problem can be done within Wrangler tools, and in particular the Column Browser. We can open it through the following icon:
This allows us to access to additional information on the Date column:
Interestingly, the “incorrect while correct” data has been detected by Wrangler, and classified as an “outlier”:
The above manipulation was thus not necessary here, but we wanted to demonstrate it nonetheless, as this is a general-purpose operation that can be a good help in some other situations.
In order to refresh the results once the input has been corrected and the CSV file saved, we have to indicate Wrangler to reload the dataset. To do so, we will use the function allowing to change the input, simply indicating the same file:
Once inside the dialog, we use the Add File command again, and point to the same file:
This brings us to the following dialog, where we click on Select:
The column analysis seems perfect now:
No mismatched values, no outliers: we are ready to proceed with the next step of the analysis. We simply need for now to return to the data grid:
Determining the category
The next step in our analysis is to determine categories of expenses and income, in order to allow for an aggregated point of view. To do so, we will create a column containing this information and add values to it step by step, by looking for keywords and patterns in the description of the bank operations.
First, we need to create the empty column. For this, we will use the derive action, and this time, we will not use any suggestion, but simply start typing the name of the command in the Transform Editor:
The typing assistance shows that the first parameter is value. In our case, we will use an empty string, so as to keep the column blank by default. Next, we use the as parameter to express the name of the column to be created, and click on Add to Script to execute the operation:
Since we will use the words in the Description column and they are both upper- and lowercase, we will first operate a transformation on this column to set the same case everywhere. To do so, we select the column by its title header, and look for the right suggestion:
The second option in the Set suggestions block seems to be what we were looking for, so we accept it.
We can then use the content to determine the category of the lines. To do so, we will use the set command again, but this time with a given value, and a condition on which line to apply it to. One of the most obvious value is the Food category, which should be used whenever the words “Restaurant”, “Boulangerie”, “Boucherie” or “Carrefour” appear (the last three are words for bakery, butcher shop and a supermarket brand in French). This time, just like above, we will directly use the Transform Editor to express what we mean to execute. The first information to pass, apart from the operation name of course, is the column on which it will act:
The next important option is the value, which is fixed in this example:
Finally, we have to create a condition on the content of the description, which will be created like this:
The matches function allow to filter the lines for which there is a match between the content of the description column and the regular expression passed as the second parameter. Notice that the separators are backward apostrophes, and not the regular ones used for the string values, just like in the value parameter. With such an example, we see above the expression that Wrangler has shown the result on a matched line.
It would of course be possible to duplicate this operation for each pattern we look for, but since we already have to duplicate the operation for each category, it sounds cleaner to search for all patterns for a given category in only one operation. To do so, we will simply use the or boolean operator, which is expressed by the double pipe symbol in Wrangler:
Clicking Add to Script executes the command, and we will add as many as necessary for the different categories we want to retrieve. For example, “EDF” stands for “Electricité de France”, and “SAUR” is a water supplier, so all lines related go to the House category, just like the internet subscription:
“Forfait” and “Virement” are bank-related French words:
“Voiture” is the French word for “car” and “Essence” for “gasoline”, so all this goes to Transport:
After a few more of these manipulations, the spectrum at the top of the Category column should show that there are not that many more unrelated lines:
This means we can simply take the rest and sort it as “Other”. But to do so, we still have to make the difference between “Other expenses” and “Other incomes”, of course. We will use the fact that either the Credit column or the Expense column is filled. The first operation to add is the following:
This solves some part of the problem:
The second operation is as follows:
We were expecting to see no line without a category, but one remains:
Since the Column Details do not show the exact value, we will use the same trick as shown above, and add a temporary Keep operation on the missing value. The result is the following line:
Again, this seems like an incorrect input, as there should not be a Credit and an Expense in the same line. We correct the CSV file and reload it, like before. The explanation for this exception is that the payment for those cinema tickets have never been taken by the bank, so the original account Excel file showed the same amount in the Credit and Expense, to account for this peculiar situation. The amounts are now not the same because of the anonymizing macro.
Since the categories are now OK, we can get rid of the Description column, and this will be the last operation before the next step in our analysis:
Thanks for tuning in to part 1 of wrangling banking data! In part 2, released next week, JP will go over the final steps of Grouping and Exporting his results. Be the first to know by following us on Twitter @Trifacta as well as on LinkedIn. Also, if his wrangling has inspired you to do a bit of your own, you can sign up for Trifacta Wrangler for free. Share your stories with us at firstname.lastname@example.org!