Start Free

Speed up your data preparation with Trifacta

Free Sign Up
Summer of SQL

A Q&A Series with Joe Hellerstein

See why SQL is Back
 
All Blog Posts

View from the Summit: Data vs. Delta (and Other Infectious Diseases)

September 8, 2021

Here in the United States, it looks like the Delta variant of COVID-19 may drive us back into our pandemic pods instead of back to school. But I have reasons to be optimistic.

Fighting infectious diseases is, of course, a health problem, but the key to enabling every stakeholder – from the scientists to the public health officials to school superintendents to parents – is to get them the data they need to understand what is really happening and to make critical decisions. I’m confident that data—and the technologies that radically improve the productivity of people who work with data—will help researchers and health organizations worldwide to collaborate, experiment, iterate, and quickly find ways to prevent, treat, and cure infectious diseases like COVID-19. 

I’d like to share three examples of health organizations that are using the Trifacta Data Engineering Cloud to curate and share data at unprecedented speed and scale, develop better prevention strategies, and advance scientific discoveries. 

CDC Puts Data Quality on the Map

The U.S. Centers for Disease Control and Prevention (CDC) uses fine-grain resolution maps of transmission dynamics to help control outbreaks of infectious diseases. But to understand who’s connected to whom and see how different outbreaks and transmission chains spread across a certain locale, computational biologists at the CDC needed to turn seemingly endless rows of names, addresses, dates, and other data points into sleek, visual maps. 

For example, is the address 2400 N DRUID HILLS RD NE the same place as DRUID HILLS TARGET? Is IGNACIO RODRIGUEZ the same person as NACHO RODRIGUES? Is 2020/03/04 the date for March 4, 2020 or April 3, 2020? While these kinds of data quality issues seem small and niggling, they have enormous consequences when it comes to tracking COVID-19 cases, particularly in bidirectional contact tracing. 

The CDC relies on the Trifacta Data Engineering Cloud to automate this tedious, time-consuming, labor-intensive work. Trifacta’s adaptive data quality techniques help interpret and understand the reliability of data and provide intelligent suggestions to correct anomalies to ensure the profiled data is clean, accurate, and of high quality. These continuous data quality checks help trusted data to be consumed by downstream applications, preventing faulty data from compromising the CDC’s mapping outputs. 

IDDO Discovers Data Sharing (and Curation and Harmonization) Is Caring

At the University of Oxford, the Infectious Diseases Data Observatory (IDDO) data platform hosts one of the largest international collections of clinical data related to COVID-19. Thousands of independent hospitals and health institutes worldwide have shared their individual patient data, treatment data, symptom data, and microbiology data. The data comes in as everything from simple spreadsheets to exports from sophisticated statistical packages which require standardization in order to be useful. 

In parallel, IDDO is also working to analyze and share information on all of the COVID-19 clinical trials registered across the international study registries. The lack of standardized data capture and nomenclature from country to country makes understanding and using these data very difficult. At one point, IDDO identified more than 35,000 distinct terms for drugs.

Precious COVID-19 clinical trial data, desperately needed at a time of global pandemic and great suffering, was a mess before curation.

Image credit: https://wellcomeopenresearch.org/articles/5-116

To untangle this massive, messy hairball of data, IDDO relies on the Trifacta Data Engineering Cloud to do the data dirty work—curating the data, standardizing it according to common codes and practices, and harmonizing it to create clean, usable datasets that researchers worldwide can see and share. This is what this data looks like after curation:

Image credit: https://wellcomeopenresearch.org/articles/5-116

The result? COVID-19 researchers now know who’s collecting what type of data and where. They can see the clinical trials are already underway so they don’t waste time on redundancies. And they get a massive head start on their work with clean, standardized, readily available datasets.

Genomics England Builds Data Pipelines to New Genetic Discoveries

Genomics England partnered with the GenoMICC Consortium, led by the University of Edinburgh, to create a new research environment for understanding the role of genetic risk factors in patient responses to COVID-19. 

Tens of thousands of patients diagnosed with asymptomatic, mild, severe, and long-haul cases COVID-19 were recruited to contribute to Genomics England’s data ecosystem. And they decided the best and fastest way to stand up a new research environment was to build it in the cloud. 

Genomic data has to be supplemented with clinical data from the National Health Service (NHS), Public Health England, and other providers to provide context for research. Some of these datasets are mind-bogglingly huge. A single 10-gigabyte dataset from the NHS on hospital statistics was 138 fields wide and more than 3 million rows long! 

Data on this massive scale streams into the Trifacta Data Engineering Cloud where it’s stored, de-identified, cleaned, profiled, standardized, and automatically pipelined out to researchers. 

Genomic England’s ability to harness the power of data may lead to the discovery of genetic risk factors and new treatments for COVID-19 patients. This kind of breakthrough work, at breakneck speed, wouldn’t have been possible even a few years ago. 

These three organizations presented their incredible work at Wrangle Summit 2021, our inaugural industry conference held in April 2021. To view the full presentations of their important work on our website here.

Now It’s Your Turn

Do you have a story about data’s role in helping health researchers to collaborate, experiment, iterate, and quickly find ways to prevent, treat, and cure infectious diseases like COVID-19? 

Please share your thoughts with me at msarbiewski@trifacta.com.