Jeff Everham, Informationalist at Knowledgent, blogs on the company’s recent TeKathon (Knowledgent’s version of an analysis hackathon) using Trifacta for Health and Life Sciences analytics in this guest blog post. The analytics firm had two teams–one Health and Life Sciences and another Financial–learn Trifacta software in order to challenge them to a data wrangling contest. With the added challenges of real-world constraints of time and resources, each team’s data sets were then assessed based on innovation, execution, business model and presentation.
As stated in the previous TeKathon II Overview, the Healthcare and Life Sciences team used Trifacta to prepare data for a Pharmacovigilance data science project. Pharmacovigilance is the detection, assessment, understanding, and prevention of drug-related adverse events, where adverse events are unfavorable and unintended signs, symptoms, or diseases associated with the use of a medicinal product. Pharmacovigilance is increasingly important in understanding and managing drug safety and has the following goals:
- Continuous monitoring of medicines in clinical practice to identify new hazards or changes in safety profile
- Assessing the risks and benefits of medicines, determining actions to better protect patients
- Providing awareness of new findings such as increased dangers related to comorbid conditions or drug interactions
These goals perfectly align with Knowledgent’s mission:
- Protecting people from the harm and dangers of prescription medicine
- Improving lives with better care from awareness of risks when choosing therapies
- Reducing cost and liability from patients experiencing adverse events
- Building trust and reputation of drug manufacturers through better drug safety
To look at Pharmacovigilance and adverse events in Trifacta, we chose to look at FAERS (FDA Adverse Events Reporting System) data. With over 7 million reports since 1997, FAERS is the world’s largest database of adverse events. This database provides real-world evidence of drug safety issues, but it is messy and unusable.
There are seven normalized files released each quarter, all related by one common key field: ISR (Individual Safety Report) prior to Q4 2012 and Primary ID from Q4 2012 to present. The demographics file is the primary file, while files with data on drugs, reactions, therapy, outcomes, indication, and reporting source provide additional information about the adverse event.
We faced many obstacles while using this data – the files for each quarter needed to be appended together, a large number of fields were missing, the units were inconsistent, newer information replaced earlier reports that were still being reported, the seven tables needed to be joined together, there were periodic format changes, there were many duplicate entries, and there were many drug name variations – clearly the data was a disaster.
However, a lot of these issues were squashed using Trifacta. Advanced data wrangling greatly speeds data preparation and paves the way to achieve Pharmacovigilance for everyone. We were able to turn messy data into clean data by profiling, transforming, standardizing, cleansing, de-duplicating, and enriching the data through the steps below.
- Combine datasets: Here, we appended quarterly files (ex: 2007Q1 to 2012Q3).
- Remove columns: We removed columns that were unnecessary.
- Split columns: We created a new column from an existing column by extracting part of the value.
- Deduplication: We eliminated duplicate records across quarters.
- Profile data: We profiled the data to identify invalid values, missing values, data types, and outliers.
- Normalize units: We converted inconsistent units to a common unit (e.g., the original data file had age reported in years, months, days, and even hours).
- Join datasets: We joined the demographic file with the other files to enable more thorough evaluation of things like reactions, outcomes, and comorbidity.
- Clean up fields: We cleaned up poorly-formatted fields.
- Enrich data: To make the FAERS data more usable, we enriched it by mapping active ingredients to drug names, allowing analysis across drugs with different names (such as generics) but with the same active ingredient.
- Export wrangled dataset: We exported the dataset and then imported it into Tableau and QlikView to visualize the data for exploration and analysis.
The primary goal of the TeKathon was to focus on real-life use cases. Our primary use case was to improve lives through a patient-based early warning system. This would protect patients from potentially harmful treatments, reduce costs of complications for payers, give better care to providers, and reduce product liability for pharmaceutical companies. Our second use case was to improve data quality and value, which was accomplished through rapid analysis of outliers, correction of anomalies, and enrichment of the data with active ingredient information. For our third use case, we proposed that the data be used for research and continued medical education by identifying patient cohorts who are at higher risk of adverse events. This informs patient and physician decisions to modify patient treatment to provide better care with lower risk and reduce costs due to complications for healthcare providers and health insurance companies.
The greatest achievement was accomplishing all of this with only a few weeks. We achieved analysis-ready data quickly and cost-efficiently, which is impressive because data prep is 80% of analysis, as well as demonstrated three use cases without the manual and grueling data munging and cleanup typically associated with messy datasets. Trifacta’s sophisticated visual interface and its predictive qualities made this possible. Trifacta is one of several advanced data wrangling tools Knowledgent has seen in the market that hold tremendous promise to speed the time to insight as the world’s data continues to grow and companies seek faster and more cost-effective insights with less labor from their mountains of data. As we collect more data faster than ever before, the opportunities for greater insights abound…but the data won’t get cleaner on its own. Sophisticated data wrangling tools will pave the way for data scientists and business users alike to rapidly and confidently prepare data for the insights of the future.