Jean-Philippe Gouigoux has spent several years as the lead technical project manager specializing in debugging, software architecture and integration of R&D. He has authored numerous published works and is currently working on a book. You’re tuning into part two of his blog, the first of which is published on our site here. You can also view his blog for updates on his book as well as the French translation of this article. Be sure to click on the ‘PDF’ icon in JP’s blog to view the full article.
Grouping the values
The categories are now revealed, and the data has been cleaned. Since what we want is to have an analysis of the amounts in the different categories for some periods of time, we will now group the values on these two axes.
First, we will operate on the time axis. To do so, we need to get rid of the incomplete years using a delete command, in order to allow for a meaningful comparison. The spectrum at the top of the Date column shows that there are three complete years, and that 2012 and 2016 have fewer values:
The following operation allows to get rid of the unwanted periods (be aware of the difference between “Date” the column and “date” the function to express a date in Wrangler):
This is of course reflected by the spectrum for the Date column:
In our analysis, we are interested in aggregating the values per year, so we are going to add an operation to keep only the year part of the date:
Note that we could have inverted the last two operations, which would have made the first one a bit easier (comparison of years instead of actual dates).
Once we reached this point, we should normally be settled and ready to make an aggregation. Let us try, by opening the additional Tools section (click on the hammer icon on the left of the Transform Editor and click on Aggregate):
Using the handles on the left of the four available data lines, drag and drop the columns so as to reach such a state of your interface:
There seems to be a problem, as all the summed values are equal to zero. This comes from the fact that the decimal separator in France is the comma, and not the point, but Wrangler does not use the regional settings yet. We then have to replace this in the CSV file using a text editor, and reload the file in the dataset again. Let us click on Cancel to come back to the previous state (even if we could activate the command, which is true in itself).
When coming back to the operation, the result is better:
We can click on Add to Script to make this operation part of the Wrangler script:
Finally, in order to ease the use of the amounts, we can create a column that express only the difference between the credits and expenses, so that the two columns are grouped into one, which will be positive for the incomes and negative for the expenses.
The two original columns can be removed:
The result is as we would expect, except for the values of course, due to the anonymization:
Exporting the result
As a summary, the script is the following:
It should be noted, though, that all categories have not been shown here, for the sake of simplicity.
It is now time to explore what we have created, and export the data in order to create graphs on it. To do so, we click on Generate Results and set the required options:
Our dataset was small (a few thousand lines), so everything was loaded when we were creating the analysis, but the force of Trifacta is to be able to cluster the calculation on huge datasets. Once the process is finished, we are brought back to the initial interface, in which Wrangler offers an icon to go to the resulting folder:
The resulting CSV file can be imported in Excel (for French users, let us not forget to change the separator back to the expected comma):
Of course, all sorts of graphs can then be created from the data, like the following:
The budget of the family can then be better explored and understood. Moreover, when lines are added, it is very easy to go back to the Wrangler workspace and run the analysis again, which makes for a smooth experience.
Wrangler is a great and ergonomic tool that I was able to explore and show you through this blog. For more examples, stay tuned on my blog for the announcement of my book on Open Data analysis coming to you soon.
Thanks for tuning in to part 2 of wrangling banking data and stay tuned for future pieces by JP and other guest bloggers. Be the first to know when we publish by following us on Twitter @Trifacta, LinkedIn and hitting the “Subscribe” button in the image at the top of the page. Also, if JP’s wrangling has inspired you to do a bit of your own, you can sign up for Trifacta Wrangler for free. Share your stories with us at email@example.com!