With stringent regulations and heightened security concerns, today’s organizations must keep a close eye on their data—where it’s going, where it’s been, and how, exactly, it was transformed. Called “data lineage,” being able to trace the footprints of an organization’s data usage is key to effective data management. It allows organizations to ask questions such as: How is certain privacy-sensitive data being used? Where do errors or outliers arise, and how do they propagate forward? Where are inefficient or unnecessary processing steps being taken? Which data sets and table columns are driving key performance indicators?
In the financial services industry, data lineage is closely tied to use-cases around regulatory reporting and compliance, an initiative that often demands its own team and budget. According to the Financial Times, some banks have added as much as $4B in expenses related to compliance technology and staff, especially those designated as Globally Systemically Important Banks (GSIB). Being able to efficiently and effectively produce records that demonstrate compliance is increasingly important for financial services firms of all sizes.
The Challenge of Tracking Data Lineage with Legacy Tools
Though some form of data lineage has been available in existing tools, they have limitations. Spreadsheet programs such as Excel do offer users cell-level lineage, or the ability to see what cells are dependent of another, but the structure of the transformation is lost. Similarly, ETL or mapping software provide transform-level lineage, yet this view typically doesn’t display data and is too coarse-grained to distinguish between transforms that are logically independent (e.g. transforms that operate on distinct columns) or dependent.
The most popular way we’ve seen customers track this is by using a spreadsheet that displays a record for each column in a table. These columns are often categorized into at least two classes: “source” columns, which are in the source data, and “derived” columns which are computed from source or other derived columns. Each group typically has some metadata describing how the column should be interpreted and how it was computed for derived columns, but it can be tedious to track dependencies across complex workflows using this spreadsheet layout.
Introducing Column Lineage: Building Upon Trifacta’s Advanced Data Lineage Tracking
Trifacta’s column lineage feature provides in-depth data lineage visibility for various compliance and regulatory requirements. While Trifacta already provides visibility into the lineage of data created from the wrangling process, we knew that our customers—especially those in financial services—would value additional ways to understand how data was transformed. To help solve these use cases, we have been developing interactive visualizations to support column-level data lineage.
The column lineage feature in Trifacta gives users the ability to trace exactly how a given column or set of columns were created, as well as trace forward to see downstream dependencies from source or intermediate attributes. In addition, users can filter the lineage interface to focus only on specific types of dependencies or transformation steps that affect columns in particular ways.
Impact on Financial Services Organizations
With Trifacta’s column lineage feature, financial services companies can understand exactly how derived data or features used in reports and analytic projects were created. This will have a huge impact on the way our financial services companies execute compliance and regulatory reporting, as well as customer 360º initiatives. As we roll out this feature, we’re excited to see what new insights our customers discover, and how the overall wrangling process becomes more efficient and effective.