Discovering exactly what is in your data and how it might be useful for different analytic explorations is key to quickly identifying the value or potential use of a dataset. This exploration process allows you to gain an understanding for the unique elements of the data such as value distributions and outliers to inform the transformation and analysis process.
What is Data Wrangling?
Successful analysis relies upon accurate, well-structured data that has been formatted for the specific needs of the task at hand. Yet, today’s data is bigger and more complex than ever before. It’s time-consuming and technically challenging to wrangle it into a format for analysis. Data wrangling is the process you must undergo to transition raw data source inputs into prepared outputs to be utilized in analysis and various other business purposes.
What is Trifacta?
At Trifacta, we’re focused on providing software that helps individuals and organizations more efficiently explore, transform and join together diverse data for analysis. Whether you’re working with files on your desktop, disparate data in the cloud or within large-scale data lake environments, Trifacta will accelerate the process of getting data ready to use.
Visually explore, transform, clean and join together diverse desktop files. Completely free.
Intelligent recommendations for cleaning and formatting data
Best-of-breed hybrid desktop application
Utilize advanced self-service data preparation for teams and departments.
Leverage a shared platform for data preparation
Automate wrangling operations that span diverse sources
Deploy an enterprise data wrangling platform for large-scale analytics initiatives.
Empower analyst teams while maintaining governance
Diverse deployment & processing support to scale to any workload
The Data Wrangling Process in Trifacta
Structuring is needed because data comes in all shapes and sizes. Data lacking human-readable structure is difficult to work with using traditional applications. Even well-structured datasets often lack the proper formatting or appropriate level of aggregation required for the analysis at-hand.
Cleaning involves taking out data that might distort the analysis. A null value, for example, might bring an analytic package to a screeching halt; it may need to be replaced with a zero or an empty string. Particular fields may need to be standardized by replacing the many different ways that a state for example might be written out -- such as CA, Cal and Calif -- with a single standard format.
Enriching allows you to augment the scope of your analysis by incorporating disparate internal or 3rd-party data into your analysis. This step includes executing common preparation tasks such as joins, unions or complex derivations. Purchase transaction data, for example, might benefit from data associated with each customer's profile or historical purchase patterns.
Validating is the activity that surfaces data quality and consistency issues, or verifies that they have been properly addressed by applied transformations. Validations should be conducted along multiple dimensions. At a minimum, assessing whether the values of an attribute/field adhere to syntactic constraints as well as distributional constraints.
Publishing refers to planning for and delivering the output of your data wrangling efforts for downstream project needs (like loading the data in a particular analysis package) or for future project needs (like documenting and archiving transformation logic). Downstream analytic tools have dramatic performance increases when they encounter data structured in a certain fashion.
How It Works
Trifacta sits between data storage and processing environments and the visualization, statistical or machine learning tools used downstream in the analysis process. Our solution is designed to help data analysts do the work associated with data preparation without having to manually write code or use complex mapping-based systems.
With Trifacta, users are able to Interactively Explore the content of their data and through a process called Predictive Transformation, define a recipe for how the dataset should be transformed. This logic is used to define how the data is processed either on your desktop, server, cloud environment or Hadoop. Prior to executing the transformation, the user defines the desired location and format for the clean, well-structured output dataset used in analysis.
Optimized for Today's Leading Cloud Platforms
Trifacta has extensive support and integration with leading cloud platforms. As a growing number of computing workloads transition to cloud-based environments, our ability to support a variety of different cloud deployments and integrate with a growing number of cloud services is critical to the data wrangling needs of modern organizations.
Who is it for?
Trifacta has brought an entirely new level of productivity to the way our analyst and IT teams work together to explore diverse data and define analytic requirements.