Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Moving Analytics to the Cloud?

Survey of 600+ data workers reveals biggest obstacles to AI/ML in the cloud

Get the Report
Schedule a Demo

Giddyup!
Wrangling JSON Metadata via Un-Nesting

October 27, 2017

JSON is a popular file format used to store unstructured content. Many popular databases use JSON to, for example, store media files. While it’s very easy for machines to parse JSON, JSON presents some challenges for analysts because data is nested inside of the document.

For our example, we’ll look at Twitter data. Opening our 59MB file in a text editor reveals the data.

{“twitter_info1″:”{\”hashtags\”:[],\”created_date\”:\”2015-04-16T19:38:58\”,\”mention\”:[\”USTELCOCares\”],\”twitter_text\”:\”@USTELCOcares i haven’t been an USTELCO customer in 10+ years, but suddenly received bill for svcs never used. customer svc ppl can’t find record\”}”,”user_info1″:”{\”user_name\”:\”Jason Oh\”,\”user_id\”:\”jasonoh\”,\”screen_name\”:[\”USTELCOCares\”],\”location\”:\”ManhUSTELCOan\”,\”friends_count\”:\”79\”}”,”created_date”:”2015-04-16″,”created_time”:”19:38:58″,”twitter_text”:”@USTELCOcares i haven’t been an USTELCO customer in 10+ years, but suddenly received bill for svcs never used. customer svc ppl can’t find record”,”twitter_text2″:”@USTELCOcares”,”Handle”:”@USTELCOcares”,”friends_count2″:”79″, “location2″:”ManhUSTELCOan”,”screen_name2″:”[\”USTELCOCares\”]”,”user_id2″:”jasonoh”,”user_name2″:”Jason Oh”,”friends_count1″:”79″,”location1″:”ManhUSTELCOan”,”screen_name1″:”[\”USTELCOCares\”]”,”user_id1″:”jasonoh”,”user_name1″:”Jason Oh”,”friends_count”:”79″,”location”:”ManhUSTELCOan”,”user_id”:”jasonoh”,”user_name”:”Jason Oh”} {“twitter_info1″:”{\”hashtags\”:[\”USTELCO\”,\”cisco\”,\”network\”],\”created_date\”:\”2015-04-16T19:39:45\”,\”mention\”:[],\”twitter_text\”:\”#USTELCO taps Ciena, #Cisco to build its 100G U.S. metro #network http://t.co/SZLdecjWdC\”}”,”user_info1″:”{\”user_name\”:\”ExterNetworks Inc\”,\”user_id\”:\”ExterNetworks\”,\”screen_name\”:[],\”location\”:\”Piscataway, NJ 08854 \”,\”friends_count\”:\”1229\”}”,”created_date”:”2015-04-16″,”created_time”:”19:39:45″,”twitter_text”:”#USTELCO taps Ciena, #Cisco to build its 100G U.S. metro #network http://t.co/SZLdecjWdC”,”twitter_text2″:null,”Handle”:null,”friends_count2″:”1229″,”location2″:”Piscataway, NJ 08854 “,”screen_name2″:”[]”,”user_id2″:”ExterNetworks”,”user_name2″:”ExterNetworks Inc”,”friends_count1″:”1229″,”location1″:”Piscataway, NJ 08854 “,”screen_name1″:”[]”,”user_id1″:”ExterNetworks”,”user_name1″:”ExterNetworks Inc”,”friends_count”:”1229″,”location”:”Piscataway, NJ 08854 “,”user_id”:”ExterNetworks”,”user_name”:”ExterNetworks Inc”}

For the purpose of demonstration, we can format one record a little differently by hand so we can more clearly see the content of the tweet and the associated metadata:

{  

   “twitter_info1”:“{\”hashtags\”:[],\”created_date\”:\”2015-04-16T19:38:58\”,\”mention\”:[\”USTELCOCares\”],\“twitter_text\”:\”@USTELCOcares i haven’t been an USTELCO customer in 10+ years, but suddenly received bill for svcs never used. customer svc ppl can’t find record\”}”,

   “user_info1”:“{\”user_name\“:\”Jason Oh\”,\”user_id\”:\”jasonoh\”,\”screen_name\”:[\”USTELCOCares\”],\”location\”:\“ManhUSTELCOan\”,\”friends_count\”:\”79\”}”,

   “created_date”:“2015-04-16”,

   “created_time”:19:38:58,

   “twitter_text”:“@USTELCOcares i haven’t been an USTELCO customer in 10+ years, but suddenly received bill for svcs never used. customer svc ppl can’t find record”,

   “twitter_text2”:“@USTELCOcares”,

   “Handle”:“@USTELCOcares”,

   “friends_count2”:“79”,

   “location2”:“ManhUSTELCOan”,

   “screen_name2”:“[\”USTELCOCares\”]”,

   “user_id2”:“jasonoh”,

   “user_name2”:“Jason Oh”,

   “friends_count1”:“79”,

   “location1”:“ManhUSTELCOan”,

   “screen_name1”:“[\”USTELCOCares\”]”,

   “user_id1”:“jasonoh”,

   “user_name1”:“Jason Oh”,

   “friends_count”:“79”,

   “location”:“ManhUSTELCOan”,

   “user_id”:“jasonoh”,

   “user_name”:“Jason Oh”

}

The JSON document consists of name/value pairs, where values can also be arrays. In the above example, on a user record, one attribute is named `friends_count1` and the value is `79`. As an analyst, you may have thousands or millions of JSON documents to analyze. Pulling out the values and ordering them can be extremely complex. Fortunately, with Trifacta, there are some easy shortcuts that make working with JSON data a snap.

Un-nest Transformation

When you import your JSON file, Trifacta automatically begins to organize your data and provide column headers.

 

Built into every edition of Trifacta is an unnest transformation. The Unnest Transform expands the contents of an array or map and creates new columns in your dataset for each element of the array or map. For maps, unnest takes each key-value pair and creates a new column with the key as the column header and the values as the columns rows. Simply selecting all the data in your column, Trifacta will recognize each name/value pair in the JSON map and split them out into separate columns with the appropriate headings.

More advanced Trifacta users can also write a recipe using the Transform Builder to achieve the same results.

Your Turn

And that’s it! As always, if you have any questions, you can visit the Trifacta Community for help. 

Related Posts

February ‘19 Wrangler Release – Enhanced Standardization

Trifacta’s February ‘19 Wrangler release brings a preview of new data quality features to be released... more

  |  February 7, 2019

Google Launches Public Beta of Cloud Dataprep, Built in Collaboration With Trifacta

Google recently announced that Google Cloud Dataprep—the new managed data wrangling service developed in... more

  |  September 21, 2017

2019: a Year of Innovation with Cloud Dataprep on the Google Cloud

As we get into the new year, it’s a good time to look back in 2019 and reflect on a productive year of... more

  |  January 18, 2020