Start Free

Speed up your data preparation with Designer Cloud powered by Trifacta

Free Sign Up
All Blog Posts

Intelligence to Automate Common Data Structuring

January 13, 2016

It is well-known that data analysts spend 80% of their time on preparing data to make it available for analysis, and only 20% on the actual analysis. Our goal at Trifacta is to drastically change that ratio so that you can spend more time on the actual analysis.

The first step in making data available for analysis is to bring it into a tabular format. In our most recent release, Trifacta v3, we completely reworked our machine intelligence to help automate this first step of applying tabular structure to a wide range of file formats, to accelerate the overall preparation process.

Splitting data into tables

A very common data format is the comma separated value (CSV) format. In a CSV file, every line is one data record; the fields of the record are separated by commas. Specific fields may be surrounded by quotes, and the first row may be a header row, meaning that it contains the names of the columns in the file. For example, a file might look like this:


If you look closely, you can already see the tabular structure: there are three columns called “Date”, “ItemID”, and “Amount”, and seven rows of actual data. Trifacta supports parsing of these files since v1. If you load this file into Trifacta, you are immediately presented with your data, ready to wrangle.

Without any user interaction, Trifacta automatically recognizes the file format and generated the necessary steps to make the data usable, which you can see if you expand the script:

It splits the data at commas, removes the now redundant quotes, and recognizes that the first row contains the column names.

Other formats

While a comma is the most commonly used separator, there are other popular choices, for example a tabstop character, a pipe character (|), or a semicolon; Google’s DoubleClick recommends the use of a thorn character (þ); and our customers report separators ranging from dollar signs to zero width space characters. When it comes to choosing separator characters, the possibilities seem limitless.

Excerpt of a DoubleClick Data Transfer file



To make matters even more challenging, separators don’t have to be a single character, and they can vary. For example, the access logfile of a web server typically looks like this. - - [05/Jan/2016:16:02:24 -0800] "GET /logs/access.log HTTP/1.1" 404 73 "" - - [05/Jan/2016:16:02:56 -0800] "GET /index.html HTTP/1.1" 200 14007 "-" - - [05/Jan/2016:16:05:28 -0800] "GET /data/5/3 HTTP/1.1" 200 935 "-" - - [05/Jan/2016:16:05:36 -0800] "GET /logs/access.log HTTP/1.1" 200 697 "" - - [05/Jan/2016:16:43:22 -0800] "GET /images/back.png HTTP/1.1" 404 88 "-" - - [05/Jan/2016:16:10:07 -0800] "GET /pdf/report.pdf HTTP/1.1" 404 73 "" - - [05/Jan/2016:16:10:07 -0800] "GET /index.html HTTP/1.1" 404 73 "" - - [05/Jan/2016:16:10:08 -0800] "GET /index.html HTTP/1.1" 404 73 "" - - [05/Jan/2016:16:12:05 -0800] "GET /datasets/ HTTP/1.1" 200 2266 "" - - [05/Jan/2016:16:12:51 -0800] "GET /images/front.png HTTP/1.1" 206 207560 "" - - [05/Jan/2016:16:13:03 -0800] "GET /images/middle.gif HTTP/1.1" 200 2816 "" - - [05/Jan/2016:16:29:30 -0800] "GET /index.html HTTP/1.1" 200 23234 "-" - - [05/Jan/2016:16:33:18 -0800] "GET /style.css HTTP/1.1" 200 1344 "" - - [05/Jan/2016:16:38:47 -0800] "GET /js/script.js HTTP/1.1" 200 5372 ""


Several fields stand out: IP addresses, dates, URLs, and more, but they are separated by spaces, brackets, and quotes. We can load the data into Trifacta and use Predictive Transformation to manually select and extract the fields within minutes. But wouldn’t it be nice if that happened automatically, just as with CSV files? In v3, we made that possible.

Extending the intelligence

To handle all three example files we have to understand the structure of a file; since the data changes from customer-to-customer and use case-to-use case, this has to happen without human intervention. To accomplish this, we designed an algorithm that can detect the structure of a wide range of file formats, even formats we haven’t encountered yet. The algorithm takes the data and converts it to a directed acyclic graph. We then apply methods of graph theory to simplify the graph, grouping similar patterns in the file, until the structure emerges. This allows us to handle formats without knowing their structure beforehand.

Example: DoubleClick Data Transfer file

For the DoubleClick file, the structure detection concludes that the thorn character is used as a separator.


It creates the script to split at that character and also recognizes that the first row contains the column names.


So immediately after loading your data, you are able to work with it.


Example: Server access log file

For the server access log, the structure detection does considerably more work.

Luckily that this is done automatically for you in the background, and you get clean access to your data.


With this new feature, we’re excited to provide our users with automated structuring and cleaning transformations that will dramatically accelerate the preparation process. We’ll be continuing to expand upon this feature in our future releases to handle a wider variety of data formats and also provide more automated transformation suggestions for common data preparation tasks.

For more on what’s included in Trifacta v3, watch our recorded webinar providing an comprehensive overview of the release here.