At Trifacta, we’re committed to clean data. That’s why we created the Clean Data Manifesto, which is focused on this commitment and anchored by our five tenets of proper data preparation. I’ve walked through the first two tenets in my last two blog posts—the first, focused on prioritizing and setting targets, and the second, about diving into the data to make sure that you’re able to identify issues early and often.
As is the case with so much critical work, data preparation is a team effort that requires the seamless orchestration of a lot of moving parts in order to curate data quality. That’s why our third tenet is collaborative curation.
Clean Data Tenet #3: Collaborative Curation in Data Preparation
Collaboration strengthens your data preparation by bringing a broader collective context to the effort, and we see this happening in organizations across a number of mediums. From business teams partnering with their IT organization, to team members collaborating amongst themselves, to analysts leveraging external resources, and, even, to leaning on AI intelligence to guide decision-making, we see collaboration as a critical means to optimize data preparation.
Collaboration between business teams and IT is foundational; an equitable partnership between these two functions is the root of successful data preparation. Whereas IT teams were once encumbered with maintaining data quality throughout the entirety of the organization, from ingestion through delivering requirements to the business, many organizations are shifting the responsibility of data quality toward business users. For one, this is a more efficient approach—instead of a small task force chasing down issues of data quality, there are more eyes on the data—but it also leads to better curation for the end analysis. IT will still curate the best stuff, make sure it is sanctioned and re-used (this ensures a single version of truth and increases efficiency). But, with business context and ownership over the finishing steps in cleansing and data preparation, these users can ultimately decide what’s acceptable, what needs refining, and when to move on to analysis.
Across a team, collaboration looks a lot like sharing best practices to streamline operations. Who’s doing this work the most efficiently? What have they learned about a particular data set that can help inform data preparation of similar sets? Ideally, analysts should be able to easily share out their work in order to improve the performance of an entire team. Using data preparation can also improve the efficiency of approvals—having the ability for managers to sign-off on work without requiring analysts to walk them through the work step-by-step can save hours of headaches.
There is huge potential for knowledge sharing within an organization, but sometimes that isn’t enough. Certain use cases require teams to think beyond internal experts to leverage external resources, where utilizing data preparation tools are invaluable. Are there external experts or organizations who have better context for this data? What third-party data sets could you leverage to supplement existing findings? Incorporating these data sets early on can potentially mean the difference between acceptable findings and unforeseen insights.
Finally, in some cases, modern teams are leaning on AI-driven technology to guide them toward the right data transformations. The back-and-forth collaboration between AI-generated suggestions and the user’s actions each inform the other, where the suggestions become increasingly accurate and the user is able to make selections faster. As opposed to manual processes that depend entirely on the user to build each data transformation from scratch, this type of collaboration is key to accelerating data preparation and landing on an output with more assurance. Analysts can leverage AI to automate some of the more complex parts of the data cleansing process, but compliment these efforts with workflows that encourage openness, iteration and crowdsourcing to tap into the collective intelligence of the organization.
A New Approach to Data Preparation
In order to build a collaborative environment, it’s important to look for alternatives to tools that limit transparency, such as scripting languages or common spreadsheet tools. Modern data preparation platforms like Trifacta operate off of a visual-based interface, which allows anyone in the organization to speak the same language, and maintains clear data lineage to improve transparency and encourage feedback about how particular data sets have been transformed.
The flip side of this—limiting data preparation to one person—can present real challenges. For example, one financial services firm used to employ one person as the sole owner of their critical regulatory reporting function. He had built a complex web of Excel spreadsheets that only he understood how to use, which meant the organization risked losing valuable processes if he were to leave. The company even ended up taking out an insurance policy on his data preparation job—seriously. Now that the company uses Trifacta, more of the team is not only able to take on the work of data preparation, but can collaborate in their reporting initiatives, leveraging sharing capabilities and data lineage tracking to keep meticulous record of their regulatory reporting efforts. (Learn more about how financial firms use us for compliance here.)
The aforementioned company is a large one—over 200 analysts use Trifacta—but this tenet isn’t limited to large teams. Consider small ways to involve other data sets or to leverage the collective intelligence of other analysts during data preparation; we’ve built a site devoted to our Wrangler community for this very purpose. At any scale, collaboration can strengthen the decisions you make and improve the result of your end analysis.