See How Data Engineering Gets Done on Our Do-It-Yourself Data Webcast Series

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

View from the Summit: It Takes a Village to Raise a Dataset at Eli Lilly

September 24, 2021

As the saying goes, it takes a village—of family members, teachers, neighbors, and a greater community—to raise a child into a healthy, productive member of society.

Similarly, it takes a village to raise a productive dataset, according to Randy Santiano, associate technical consultant at global healthcare leader Eli Lilly. Randy presented his work at Wrangle Summit 2021, Trifacta’s inaugural industry conference. In this third blog in this series, I’d like to showcase Randy’s work at Eli Lilly to inspire modern data workers everywhere to make data useful and accessible. 

To put Randy’s work in context, it’s helpful to understand the value of reducing cycle time. Previously, it took up to six months to complete integrating data sets for clinical trials. The company also had excessive costs associated with moving to native AWS services and depending on IT to create shared reference data sets. And as always in high-stakes clinical trials, governance and risk management were critical to the entire project.

Leveraging Trifacta’s Data Engineering Cloud, Eli Lilly overhauled their complex manual processes and enabled business analysts to perform much of the work, which led to improved collaboration across teams and reduced the manual execution of flows. This approach shrinks the cycle time for the integration of data sets from six months to one week, saving hundreds of thousands of dollars, and gives Eli Lilly one common platform that improves both collaboration and governance, ultimately reducing risk. 

So, how did Rany and the team do this? Let’s dive into his story. 

Collaborative Engineering in a Clinical Trial Environment 

Randy was tasked with setting up web-based dashboards to chart progress made in Eli Lilly’s clinical trials. These dashboards present metrics that are viewed and shared among operational teams and executives. The goal is to shorten the development cycles and make it easier and faster for his team to deploy different data products throughout the organization. 

The Trifacta Data Engineering Cloud enables Eli Lilly’s IT team, Trifacta power users, dashboard developers, peer reviewers, and a governance team to share insights and work collaboratively to create datasets that ultimately contribute value to the organization.

Randy identified three steps to creating quality data products:

  • Step 1: Establish the Trifacta infrastructure
  • Step 2: Develop the flow sharing economy
  • Step 3: Commit to a cycle of data product maturity

Establishing the Trifacta Infrastructure 

At the heart of the collaborative engineering environment at Eli Lilly, the Trifacta Data Engineering Cloud is used to flow information from source to target. Information from the Trifacta flow is presented as a dataset to be uploaded into a data visualization tool like Power BI. 

Capitalizing on the ability to create and share workflows in the Trifacta Data Engineering Cloud, Randy established naming conventions to identify where each dataset is in the development cycle. 

(Interested in digging into the details of Randy’s nomenclature methodology and workflow tagging protocol? Watch his full Wrangle Summit 2021 presentation.)

Nomenclature proved to be instrumental here. The prefix and file names allow everyone to trace the origin of each dataset—no need to review a reference table to know what’s connected to what; everything anyone needs to know about the flow is right there in its name. 

Developing the Flow Sharing Economy

When Randy sought to build a flow sharing economy among his team of developers and a wider set of contributors, he relied on the Trifacta Data Engineering Cloud and its self-service capabilities that enable teams to collaborate, share, and leverage the collective wisdom of the organization to create innovative data products. 

Using the naming conventions he established in the Trifacta instructure, the names of datasets were tagged to create a flow sharing economy from one developer to another. Randy also created a community of Trifacta power users as a pilot group to review, revise, and optimize workflows and share feedback amongst the developer team. “It makes your developers a lot more valuable and powerful because they’re able to operate within this community of other people,” Randy explained.

Once the pilot group completes its work, the validation process starts. Validation may involve quality-checking the source or code replication. Then workflow versions are confirmed and locked down and can be shared as output. 

Committing to Data Product Maturity

“Data maturity” refers to the point at which “customers” of a dataset interact with the information and then help it to mature by incorporating business or technical insights. In a typical IT environment, the data maturity cycle goes from creation to validation to production.

There’s no hard and fast rule for how data should mature, but Randy outlined the steps from the data source to an endpoint he calls “the marketplace,” a relatively new concept at Eli Lilly. The marketplace is a point at which his team can produce a data product that can be shared across the organization. 

Reaching a level of data product maturity takes time. The output promoted to the marketplace goes through many cycles, but the marketplace is not the final destination. A governance team evaluates datasets to determine which data products should be shared with the rest of the organization.

Once a data product is created, it can be shared easily with dashboard developers, who can share the data product with an audience. The clinical trial datasets are an aggregation of many other datasets or other data sources. Randy’s team uses the Trifacta Data Engineering Cloud to pull Eli Lilly’s clinical trial datasets together in a way that can be replicated easily and can be refreshed automatically.

Now It’s Your Turn

As Randy demonstrated, the Trifacta Data Engineering Cloud democratizes data and helps expand the number of people and teams to contribute to the data engineering process. The Trifacta-based collaborative engineering environment at Eli Lilly empowered the organization to turn datasets into valuable data products that can be used to drive advanced insights and analytics.

What would a collaborative engineering environment do for your organization?
How does your organization create and share datasets and workflows?
Who in your organization would be part of your flow sharing economy?
What would data product maturity look like in your organization? 

Please share your thoughts with me at comments@trifacta.com.