Ever experience the excitement of discovering a deep pattern in a dataset? Or been continually amazed at a predictive engine’s ability to find the right answer? The greatness of working with data is only offset by the soul-crushing tedium of preparing data. (Ok, I might be overstating that a bit, but only a bit.)
I joined Trifacta because their vision is the right one — “radically simplify the way people work with data.” One of our core approaches to realizing that vision involves building learning into our product. Learning enables a consistent improvement in the link between ambiguous initial user inputs and fully specified actions. In other words, over time, our software gets better at predicting what a user wants to do from the first few inputs that they provide.
So where are the challenges? In the rest of this blog, I’ll touch on a couple of general solutions that we employ to make our product intuitive and intelligent.
The Trifacta Data Transformation Platform helps users wrangle their data – manipulating its structure and contents for the benefit of downstream analytical tasks. The main challenge we face is the sheer complexity of data wrangling actions that could be, or should be, performed.
To get into the details a bit, a data wrangling step has at least three elements: (a) the type of transformation (e.g., splitting or extracting or replacing or deriving); (b) the parameters to that transformation (e.g., split on [pattern], or replace [pattern] with [pattern]); and, (3) subsets or meta-data from the dataset being wrangled (e.g., names of columns or the counts of pattern occurrences). The variety of parameter values (be they regular expression patterns or functions or column names) and the variety of elements from a dataset result in a virtually infinite space of wrangling steps. More importantly, the parameter values and elements from the data are discrete. So the resulting action space, the space that we need to generate predictions in, is full of discontinuities. The learning models we are exploring must incorporate combinatorial search routines that generate candidate predictions, which are subsequently scored and ranked to produce a final set of predictions.
So how do we tackle this complexity problem? We have two primary solutions:
First, we try to prune the combinatorial candidate generation process as much as possible by leveraging the widest set of contextual information. We group our contextual information into two broad classes: historical and circumstantial. One of the key forms of circumstantial context that we are exploring involves the occurrence and position statistics of common data patterns within a dataset. Think of this as a kind of fingerprint, where the occurrences of common patterns correspond to ridges, the gaps between patterns correspond to troughs, and we are particularly interested in places where patterns (ridges) intersect. Situating a user’s input relative to a fingerprint can provide valuable constraints on the kinds of actions they intend to perform. Furthermore, similarities between fingerprints can be leveraged to link previous actions (historical context) to new datasets.
Second, we are exploring interface design directions that reduce the ambiguity of user inputs. For example, suppose the user selects a sequence of characters within a dataset. Do they want the application to generalize their selection to all “similar” character sequences? Do they want the application to find all exact matches of that character sequence? Do they want the application to find a pattern that matches anything except that character sequence? Inferring the semantics of the selection along with appropriate fully specified action(s) is a much harder problem than would be the case if the user could provide some semantic typing to their selections, thereby removing the need for our application to infer that aspect of their behavior. Of course, the central concern here involves assessing the intuitiveness of a typed selection interaction. Should we succeed designing such an interaction, we will significantly improve the statistical regularity between user inputs and intended actions.
In our upcoming talk at the Strata + Hadoop World Conference in San Jose, Jeff Heer, Sesh Mahalingam and I will present some software architectural patterns that support building learning into complex interfaces. See you there!
Tye Rattenbury is the lead Data Scientist at Trifacta. Tye holds a PhD in Computer Science from UC Berkeley. He was a Data Scientist at Facebook and the Director of Data Science Strategy at R/GA. Follow him on Twitter: @TyeDataSci