Process, Business and Data Analyst at OptimalBI Daniel Reed takes a second look at Trifacta in his ‘Question Time‘ series. If you’ve been following along with OptimalBi’s journey through Trifacta, you’ve read Reed’s insights on learning to code with Trifacta as well as his colleagues’ thoughts on how to get started, the data wrangling process, and navigating unfamiliar data. We’re excited to share another piece from our friends at OptimalBI and hope you enjoy reading it as much as we did!
Over the last few weeks I’ve been working through a series of ‘Question Time’ blogs, taking a look at the many sets of Open Data available from NZ and the wider world. Putting these together took a bit of time and a couple of tools so thought as well as sharing the outputs, I’d share the journey and learning too. Those awesome tools were Trifacta and Qlik Sense … and a little bit of Excel. Most of the data came in XLS or CSV form, with a KML thrown in for some mapping. To get from raw source to report ready I used Trifacta to prep my data.
When I started with Trifacta I had the wrong idea about it, I was trying to use it to change individual values in the data, but that’s not really what it’s designed for. I needed to change my thinking, it’s designed as a data wrangling* tool. For those that aren’t familiar with the term, think about it as a tool to apply the same change across all of your data, not to edit individual values. While, technically, you can do that, it’s not really Trifacta’s purpose. Once I shifted my thinking, I realised it was an awesome on-the-fly data preparation*. The most impressive part is that it doesn’t really need any kind of coding expertise just an understanding of the data itself and how you want to change it.
In the spirit of keeping my process consistent (and simple), I transformed in Trifacta before bringing it into Qlik Sense. That way I knew I had a clean, ready to go data set with minimal configuration (and mess) in Qlik.
First off, if you haven’t used Trifacta before, check out this guide to getting started. It’ll tell you all about installing, through to loading the data and the basics to get started. Once you’re familiar with the app, then the examples below will help you with some of those transformations.
When I first loaded the file I noticed a couple of pre-canned transformations, Trifacta worked out the format of my file and added in the transformations needed to parse the file. What’s cool is you can edit them, even loading the file is customisable! The three it added for my CSV were:
This looks for an end of line character and creates a new row each time it finds one.
EXAMPLE: splitrows col: column1 on: ‘n’ quote: ‘”‘
This looks for a separator character, in my case it was a comma, Trifacta then breaks them out into columns.
EXAMPLE: split col: column1 on: ‘,’ limit: 24 quote: ‘”‘
The last function was to remove the quotes that signify a value. It does this by simply replacing them with nothing.
EXAMPLE: replace col: * on: ‘”‘ with: ” global: true
You can see and edit the script items using the Open Script button (top button circled below).
On top of those initial transforms I needed to remove some extra lines and get rid of the empty columns. You can see by the bars at the top that my data is inconsistent due to the extra data in the file, so to get my data report ready I added a few more changes:
Used to remove a column. Due to the formatting of the file, it had a small gap in between each column.
EXAMPLE: drop col: column6
Used as a way to remove rows based on the value in a specific column, be careful – this will remove any rows which match that specific value.
EXAMPLE: delete row: column2 == ‘Value to Exclude’
Used to identify the top record as the header row, assigning those values to the column titles. In my example, I removed all of the descriptive rows first before applying the header row, ensuring the values were the top record.
Used as a way of merging two fields. For my example I had a month and year column as separate and wanted to join the two together separated by a hyphen.
EXAMPLE: merge col: column3,’-‘,column2
Used to rename columns in the data.
EXAMPLE: rename col: column27 to: ‘Month’
Most of the transforms stripped out content I didn’t need, but there were a couple more; my dates were split over two columns (month and year), to chart them I needed to create a single value, Trifacta did this really simply.
I highlighted both columns (using the shift key) and scrolled along to Merge, it wasn’t quite perfect but close, I needed to flip them around and add a hyphen, so clicked on Modify script and edited it to the order I needed:
merge col: column3,’-‘,column2
After that I was ready to Generate Results and import into Qlik, ready to make something a bit more visual. If you want to find out what happened next, keep an eye out for Question Time: Behind the Scenes with Qlik Sense coming soon.
Keep exploring! Daniel.
Have a wrangling experience to share? Send it to us at firstname.lastname@example.org and you could see your story on our blog!
*Correction: changed from original blog’s terminology ‘ETL’ to correct term ‘data wrangling’ and ‘data preparation’