If you are trying to manage data operations in your team, the hub and spoke model may be the right way to go. First introduced as a transportation distribution paradigm in the 1950s, its principles can easily be applied to the way data teams are distributed and managed.
As many companies do, we at Trifacta also started with a data team. Tasked with a wide range of responsibilities, we brainstormed ways to keep the data team relatively flexible and nimble while still enabling the entire organization to make data-driven decisions. We became the hub. In this model, we redefined what it means to be a centralized data team – instead of providing insights, we would provide accessible, easily digestible data. The hub’s responsibilities revolve around guaranteeing a sturdy infrastructure, an uptime SLA, and monitored access to critical data.
So, how do other teams make use of this data? They’re the spokes. In this model, each data-consuming team has a data analyst, or data analyst acting persona, that helps that team leverage data to drive decision making.
More than just a team, the hub also refers to our data hub, or data warehouse. At Trifacta we’ve given preference to the Google data stack, making BigQuery our data hub containing anything from analytics to financial data.
As the hub team, our deliverable is a set of documented, digested, ready-to-use tables and views that anyone with a question can leverage to get an answer.
The hub team
The centralized, cross-functional data team is comprised of four functions across infrastructure and operations.
The infrastructure sub-team is responsible for access, availability, security and monitoring of the data throughout its lifecycle. In infrastructure we have two functions:
- Infrastructure Engineer – responsible for enabling secure access to data stores and implementing secure data replication workflows.
- Data Engineer – responsible for the data orchestration pipelines from the original source to the analytics storage
The operations sub-team on the other hand ensures that the data is coherent and can be consumed by the spoke teams. There are also two main functions in this sub-team:
- Data Analyst – responsible for preparing data for downstream consumption by the spokes teams
- Product Manager – responsible for setting the priorities and alignment with critical business objectives
The modern data stack hub
Our data infrastructure is designed around centralizing all data operations and granting secure access to the spokes.
The majority of the data is orchestrated from its source to Google Cloud using Apache Airflow. In a nutshell, Airflow orchestrates the accessing, querying, copying and pre-processing of the data, which is later dumped in Google Cloud Storage. There are two main types of alerting along this pipeline – data availability and data quality alerts.
Data availability alerts check for the access to the data and alerts Slack in case the data becomes suddenly unavailable or the querying times out. Data quality alerts check for several data quality rules, including empty files and columns whose rows all have the same values.
The minimally pre-processed raw data is loaded into Google Cloud Storage, which acts as a data lake. At this point, the data is still structured in the same way as our application metadata, and is not ready yet for consumption.
In order for the data to be ready for consumption, it needs to be prepared. We leverage Cloud Dataprep, Trifacta’s product co-developed and supported with Google Cloud. This guarantees that any transformations to prepare the data are not only easily readable, but also that collaboration and iteration are possible without any need for deep technical skills like python or advanced SQL. At this stage we do more advanced validations of data quality and data availability, made possible and easy by the UI-first functionality of the tool.
The last stop for the transformed, ready-to-consume data is the data warehouse. We leverage Google BigQuery’s extremely performant querying capabilities and out-of-the-box integration with Google Data Studio to run all of our department and company level reporting.
Once the data lands in our BigQuery data warehouse, each of the individual spoke teams can use that prepared data as the basis for business-specific pipelines and analyses. One of my responsibilities is leading adoption analytics for our customer success team. I’ll chat a little about how I drive reporting and analytics about our customer base by leveraging the platform that Cesar and his hub team have created.
Customer Analytics Use Case
My team has one overarching goal: helping our customers become successful with Trifacta as quickly as possible. Naturally, having access to product usage data can provide a wealth of insights and information. Our first major project was further preparing product usage data to feed a customer adoption dashboard so that we can understand customer activity on a daily, weekly, and monthly basis.
Connecting to Data
All of my source tables originate from the BigQuery datasets that Cesar’s hub team populates on a daily basis. I work with product usage data and Salesforce data.
- Product Usage Data
Our product usage data has an extremely complex structure. If I were to query data directly from our SaaS metadata repository, answering a simple question like “how many jobs does the average user run on a daily basis” would require 5 or 6 joins! This is where the hub team’s initial preparation work has been a critical time saver for me. Instead of needing to perform all of the initial joins myself, and risk making a mistake, I am able to start my data preparation work from a standard daily activity table created by Cesar’s hub team. Even better, that standardized table serves as the source of truth for every spoke team’s product usage analyses, so we never need to worry about Sales and Customer Success running with slightly different definitions for each critical dimension.
- Salesforce Data
Not all of our usage data corresponds to paying customers. Our data warehouse includes usage from Trifactans, free trials, partners, and custom POCs. These data points are unnecessary to include in my customer activity analysis. Consequently, I enrich our product usage data with account-level data from Salesforce. This allows me to identify paying customers and also enhance my dashboards with firmographic data contained in Salesforce.
Just like the hub team, I use Google Cloud Dataprep by Trifacta to perform all of my data preparation and data cleansing work. I can write SQL, but I typically have so many other responsibilities that I really don’t want to take the time to compose, debug, and maintain hundreds of lines of SQL code. For me, Dataprep is much faster than messing around with code.
One of my core Dataprep flows looks like this:
Basically, I’m pulling project and user level data from our BigQuery data warehouse, computing multiple aggregated rollups of that data, and then joining the rollups back together. This allows me to create a base table with multiple levels of granularity that I can use as the base of my dashboard.
Since the data team chose to standardize both the hub and spoke data engineering work in a single tool, I’m also able to reuse logic that my colleagues on other teams have created. For example, the spoke team responsible for product and engineering reporting had previously developed logic for one of my aggregations. Instead of starting from scratch and redoing that work, I was simply able to copy the relevant steps from their flow and paste those steps into my flow.
Publishing and Automating
Within BigQuery, the hub team has set up individual datasets for each spoke team to store their work. I typically write my output tables to one of two datasets: sandbox_connor when I’m developing and testing my logic, and c4l_base_datasets when I am ready to schedule my work on a daily basis.
Each of the pipelines that feed my customer usage dashboard need to run on a daily basis. However, these pipelines have upstream dependencies on the tables that the hub team populates. To ensure that my data remains in-sync with the rest of the pipeline, I share my production-ready flows to the hub team’s dedicated scheduling and orchestration account. The hub team then slots my outputs into the correct point in their existing scheduled plan.
That’s it…daily product analytics delivered to the entire organization.
Benefits of the Hub and Spoke Model
As you can see, the hub and spoke model provides us with both centralized governance of key data assets and the flexibility for different data teams to leverage the data in the unique ways they need to. It’s a cross-functional effort but one that has clear roles and responsibilities that enable the different team members to work efficiently without ambiguity.
For this model to be effective, data preparation is a vital element of the process for members of the spoke team must be able to flexibly clean, transform and blend together the different foundational datasets provided by the hub team. This is often a process change for many organizations but one that leads to tremendous gains in the quality, speed and efficiency of an organization’s analytics efforts.
Feel free to try this model out for yourself by signing up for a trial of Trifacta and inviting your colleagues to test out how this process would work on your team.