Start Wrangling

Speed up your data preparation with Trifacta

Free Sign Up
Trifacta Ranked #1 in Data Preparation Market Study

Dresner Advisory Services study reviews and ranks 24 vendors

Get the Report
Schedule a Demo

Intelligence to Automate Common Data Structuring

January 13, 2016

It is well-known that data analysts spend 80% of their time on preparing data to make it available for analysis, and only 20% on the actual analysis. Our goal at Trifacta is to drastically change that ratio so that you can spend more time on the actual analysis.

The first step in making data available for analysis is to bring it into a tabular format. In our most recent release, Trifacta v3, we completely reworked our machine intelligence to help automate this first step of applying tabular structure to a wide range of file formats, to accelerate the overall preparation process.

Splitting data into tables

A very common data format is the comma separated value (CSV) format. In a CSV file, every line is one data record; the fields of the record are separated by commas. Specific fields may be surrounded by quotes, and the first row may be a header row, meaning that it contains the names of the columns in the file. For example, a file might look like this:

"Date","ItemID","Amount"
"2015-11-02","123/16879,"3"
"2015-11-14","134/7735","1"
"2015-11-30","136/13129","4"
"2015-12-01","140/16416","1"
"2015-12-14","218/19876","5"
"2015-12-16","001/23453","9"
"2016-01-05","232/17943","2"

If you look closely, you can already see the tabular structure: there are three columns called “Date”, “ItemID”, and “Amount”, and seven rows of actual data. Trifacta supports parsing of these files since v1. If you load this file into Trifacta, you are immediately presented with your data, ready to wrangle.

Without any user interaction, Trifacta automatically recognizes the file format and generated the necessary steps to make the data usable, which you can see if you expand the script:

csv-script
It splits the data at commas, removes the now redundant quotes, and recognizes that the first row contains the column names.

Other formats

While a comma is the most commonly used separator, there are other popular choices, for example a tabstop character, a pipe character (|), or a semicolon; Google’s DoubleClick recommends the use of a thorn character (þ); and our customers report separators ranging from dollar signs to zero width space characters. When it comes to choosing separator characters, the possibilities seem limitless.

Excerpt of a DoubleClick Data Transfer file

TimeþUser-IDþAdvertiser-IDþBuy-IDþAd-IDþCreative-IDþSite-IDþCountry-IDþState/ProvinceþZip-Code
2015-11-06-00:26:17þ879þ3313229þ1710681þ53134298þ41049110þ23673202þ256þILþ60010
2015-11-08-01:15:51þ480þ2120992þ8671778þ30870463þ24233880þ20039043þ256þNJþ07726
2015-11-14-04:03:01þ908þ6105126þ1900862þ16229185þ45712263þ22556009þ256þOHþ44224
2015-11-17-06:15:23þ721þ2756980þ7481325þ74323754þ24319503þ47438046þ256þILþ60446
2015-11-26-12:41:50þ862þ7431670þ2130323þ25822702þ76902553þ97728210þ256þNJþ08088
2015-11-27-13:20:17þ441þ6037923þ4622762þ82691958þ63388703þ81776267þ256þGAþ30736
2015-11-29-17:42:15þ949þ8432620þ6343815þ53805117þ65154412þ28374994þ256þUTþ84067
2015-12-07-18:55:40þ281þ9823152þ4809114þ63658276þ43359283þ34394161þ256þMAþ01420
2015-12-12-19:52:39þ925þ8900507þ1494382þ51845562þ10949377þ94698867þ256þNCþ27330
2015-12-14-23:59:15þ910þ4745409þ3329241þ30038457þ28640469þ46989876þ256þMAþ02135

 

To make matters even more challenging, separators don’t have to be a single character, and they can vary. For example, the access logfile of a web server typically looks like this.

192.168.0.102 - - [05/Jan/2016:16:02:24 -0800] "GET /logs/access.log HTTP/1.1" 404 73 "https://google.com/"
172.19.2.102 - - [05/Jan/2016:16:02:56 -0800] "GET /index.html HTTP/1.1" 200 14007 "-"
10.102.2.45 - - [05/Jan/2016:16:05:28 -0800] "GET /data/5/3 HTTP/1.1" 200 935 "-"
192.168.16.105 - - [05/Jan/2016:16:05:36 -0800] "GET /logs/access.log HTTP/1.1" 200 697 "www.google.com/"
172.18.0.3 - - [05/Jan/2016:16:43:22 -0800] "GET /images/back.png HTTP/1.1" 404 88 "-"
192.168.200.103 - - [05/Jan/2016:16:10:07 -0800] "GET /pdf/report.pdf HTTP/1.1" 404 73 "https://www.microsoft.com/en-us/windows"
10.0.3.5 - - [05/Jan/2016:16:10:07 -0800] "GET /index.html HTTP/1.1" 404 73 "https://www.yahoo.com/"
192.168.17.103 - - [05/Jan/2016:16:10:08 -0800] "GET /index.html HTTP/1.1" 404 73 "http://www.yahoo.com/"
172.16.3.15 - - [05/Jan/2016:16:12:05 -0800] "GET /datasets/ HTTP/1.1" 200 2266 "https://www.trifacta.com/about-us/"
192.168.125.102 - - [05/Jan/2016:16:12:51 -0800] "GET /images/front.png HTTP/1.1" 206 207560 "https://www.google.com/"
172.26.0.222 - - [05/Jan/2016:16:13:03 -0800] "GET /images/middle.gif HTTP/1.1" 200 2816 "https://www.google.com/"
172.16.0.214 - - [05/Jan/2016:16:29:30 -0800] "GET /index.html HTTP/1.1" 200 23234 "-"
192.168.29.104 - - [05/Jan/2016:16:33:18 -0800] "GET /style.css HTTP/1.1" 200 1344 "https://www.baidu.com/"
10.222.5.88 - - [05/Jan/2016:16:38:47 -0800] "GET /js/script.js HTTP/1.1" 200 5372 "https://www.yahoo.com/"

 

Several fields stand out: IP addresses, dates, URLs, and more, but they are separated by spaces, brackets, and quotes. We can load the data into Trifacta and use Predictive Transformation to manually select and extract the fields within minutes. But wouldn’t it be nice if that happened automatically, just as with CSV files? In v3, we made that possible.

Extending the intelligence

To handle all three example files we have to understand the structure of a file; since the data changes from customer-to-customer and use case-to-use case, this has to happen without human intervention. To accomplish this, we designed an algorithm that can detect the structure of a wide range of file formats, even formats we haven’t encountered yet. The algorithm takes the data and converts it to a directed acyclic graph. We then apply methods of graph theory to simplify the graph, grouping similar patterns in the file, until the structure emerges. This allows us to handle formats without knowing their structure beforehand.

Example: DoubleClick Data Transfer file

For the DoubleClick file, the structure detection concludes that the thorn character is used as a separator.

doubleclick-grid-preview

It creates the script to split at that character and also recognizes that the first row contains the column names.

doubleclick-script

So immediately after loading your data, you are able to work with it.

doubleclick-grid

Example: Server access log file

For the server access log, the structure detection does considerably more work.

Luckily that this is done automatically for you in the background, and you get clean access to your data.

access-grid

With this new feature, we’re excited to provide our users with automated structuring and cleaning transformations that will dramatically accelerate the preparation process. We’ll be continuing to expand upon this feature in our future releases to handle a wider variety of data formats and also provide more automated transformation suggestions for common data preparation tasks.

For more on what’s included in Trifacta v3, watch our recorded webinar providing an comprehensive overview of the release here.

Related Posts

Cloud Dataprep Enhancements: Macros, Transform by Example and Cluster Clean

Cloud Dataprep by Trifacta’s September ‘19 Release brings three great new features: Macros, Transform by... more

  |  September 11, 2019

Data Preparation For Data Mining

Data preparation for data mining is a critical step to take in any big data effort. Sometimes, beginner data... more

  |  June 21, 2016

Addressing the Growing Need for Data Wrangling Education

Tye Rattenbury, Trifacta’s Director of Data Science and Solutions Engineering, along with Trifacta’s... more

  |  September 29, 2015