Data School

Learn More

Leverage Cloud Functions and APIs to Monitor Cloud Dataprep Jobs Status in a Google Sheet

July 28, 2020

If you manage a data and analytics pipeline in Google Cloud, you may want to monitor it and obtain a comprehensive view of the end-to-end analytics process in order to react quickly when something breaks.

This article shows you how you can capture Cloud Dataprep jobs status via APIs leveraging Cloud Functions. We then input the statuses to a Google Sheet for an easy way to check the statuses of the jobs. Using the same principle, you can combine other Google Cloud service statuses in Google Sheets to obtain a comprehensive view of your data pipeline.

To illustrate this concept, we will assume you want to monitor a daily scheduled Dataprep job with a quick look at a Google Sheet to get an overview of potential failure. The icing on the cake is that you will also be able to check the recipe name and jobs profile results in Google Sheets.

This article is a step-by-step guide to the process of triggering Cloud Functions when a Cloud Dataprep job is finished and publishing the job results, status, and direct links into a Google Sheet.

Here is an example of a Google Sheet with jobs results and links published.

Fig. 1 – Google Sheet Dataprep job results with access to job profile PDF
Fig. 2 – High-level process to trigger a Cloud Function based on a Cloud Dataprep job execution


1. Getting Started

To make this guide practical, we are sharing it here in Github, the Node.js code for the Cloud Function.

You need a valid Google account and access to Cloud Dataprep and Cloud Functions to try it out. You can start from the Google Console https://console.cloud.google.com/ to activate the services.

REMARK: To call APIs, one needs an Access Token. One must be a Google Cloud project owner to generate this Access Token. If you are not a Google Cloud project owner, you can try it out by using a personal Gmail account.

Fig. 3 – Get the access token from the Settings menu


2. Create the HTTP Cloud Function to Publish in a Google Sheet 

First, we need to create the HTTP Cloud Function that will be triggered as a Webhook when a Dataprep job has finished.

Create a Cloud Function from the Google Cloud console here. The trigger type must be “HTTP”. Give it a name and get a URL similar to  https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function. Later on, we will need this URL while creating the Webhook in Dataprep. In our example, we will use the Node.js, provided above, as the Runtime under the Source Code section.

Fig. 4 Create the Cloud Function to be called from Dataprep

 

If you want to explore more about Cloud Functions, check out this tutorial.

The Cloud Function code follows this logic:

  1. Retrieve the job id and status (Failed or Completed) provided by Dataprep.
  2. Leverage the Dataprep user Access Token in the API call in order to be authenticated by Cloud Dataprep.
  3. Get more information (status, recipe id) about the job with the getJobGroup Dataprep API call. Documentation on this Dataprep API endpoint can be found here: https://clouddataprep.com/documentation/api/#operation/getJobGroup
  4. Get information (name, description) about the job’s recipe with getWrangledDataset Dataprep API call. Documentation on this Dataprep API endpoint can be found here: https://clouddataprep.com/documentation/api/#operation/getWrangledDataset
  5. Publish information and links to a Google Sheet. Job result page and link to download Result Profile in a PDF is written in the Google Sheet:
Fig. 5 – Node.js code to get job details and publish job result status in a Google Sheet

 

The Node.js code is here. You need to edit and replace the highlighted values in red with the proper one you retrieved in your Cloud Dataprep project.

  • Access Token to call Dataprep API:
    • var DataprepToken = “eyJhdjkghdjkghjkghjkdfsghk”
  • Google Sheet ID where you want to publish the results:
    • const JobSheetId = “1X63lFIfsdfdsfsdfdsfN0wm3SKx-Ro”;

To retrieve the Google Spreadsheet ID, follow the explanations here.

  • Google API Key:
    • sheetsAPI.spreadsheets.values.append({
    • key:”AIzaSyAN7szfsdfsfsdfLh0qu8qlXUA”,

To retrieve the Google API Key, follow the explanations here.

You also need to add the following dependencies to your Node.js Cloud Function (PACKAGE.JSON tab):

{

“dependencies”: {

“googleapis”: “^42”,

“request”: “^4.0.3”

}

}

Fig. 6 – Node.js dependencies packages

 

You then need to deploy the Cloud Function. After it is deployed, the Cloud Function is running and waiting to be called from Cloud Dataprep when a job is executed. You can learn more here about deploying and executing Cloud Functions.

3. Create a Cloud Dataprep Flow and Configure a Webhook

Next, you need to create the Cloud Dataprep flow that will call the HTTP Cloud Function to publish the job result in Google Sheets.

You need to create and configure a Webhook task in your flow that will call your HTTP Cloud Function.

Fig. 7 – Creating a Cloud Dataprep flow and configuring a Webhook task on a flow

 

The Webhook task needs to be configured with this information:

  • URL: This is the URL of the HTTP Cloud Function you previously created. For example, https://us-central1-dataprep-premium-demo.cloudfunctions.net/Dataprep-Webhook-Function.
  • Headers: Use headers like those shown in the screenshot below with content-type and application/json.
  • Body: Use the value {“jobid”:”$jobId”,”jobstatus”:”$jobStatus”} as shown in the below screenshot.
  • Trigger event: You can decide to trigger the Webhook for any status or just for jobs failed or completed.
  • Trigger object: You can decide to trigger the Webhook for only specific outputs in the flow, or for any job executed in the flow.

When you have entered this information, you can test your Webhook task that calls your Cloud Function.

Fig. 8 – Webhook task parameters to call the Cloud Function

 

After you save the Webhook task, it is then ready to be called when the job is executed.

Fig. 9 – Webhook task created


4. Testing the End-to-End Process

You are now ready to test the end-to-end process by running a job from your Dataprep job and see the job result status added to your Google Sheet.

Fig. 10 – Run a Dataprep job
Fig. 11 – Job result status and links published in the Google Sheet

Lastly, you can also check proper execution details (API call with the parameter and Cloud Dataprep job status) by reviewing the Google Cloud Functions logs located here.

Fig. 12 – Cloud Functions logs


Conclusion

You should now understand the fundamental principles associated with automatically publishing Dataprep Job results in a Google Sheet, so you can monitor and share summary information easily to a broader team.

You have learned about

  • Cloud Dataprep APIs
  • Cloud Dataprep Webhooks
  • Cloud Functions calling an API

You can also extend this solution to monitor additional Google Cloud services for end-to-end data pipeline monitoring.

You’re now ready to automate the monitoring of your job status. You can also automate Cloud Dataprep leveraging another Cloud Function or an external scheduler. Take a look at these articles explaining how to orchestrate Cloud Dataprep jobs using Cloud Composer  and how to automate a Cloud Dataprep pipeline when a file arrives in Cloud Storage.

Related Posts

The Data Lake Ecosystem: Data Ingestion

In this four-part series, we’ll explore the data lake ecosystem—its various components, supporting... more

  |  May 10, 2016

Is ETL Dead? ETL vs. Data Wrangling in the Cloud

Is ETL dead? It’s a question that has come up a lot in recent years as organizations modernize their... more

  |  January 21, 2020

Closing the Gap in Big Data Projects

Since joining Trifacta, I’ve had the pleasure of spending time with a number of our customers hearing about... more

  |  November 30, 2014