Since rolling out the General Availability of Google Cloud Dataprep in Sept 2018, we’ve continued to see rapid growth in users and Dataprep jobs from around the world. While you will see many improvements across the product in our latest release, there are two areas of particular focus: job customization and data quality capabilities.
Simple Dataprep Job Customization
With several 100,000 jobs running per month, customers frequently asked for more options in tuning the Dataflow job that Dataprep creates. While this could be achieved through Dataflow templates, we wanted to make it much easier and more automated for any user. As a result, users can now choose regional endpoints, zones and machine types directly in the Dataprep job UI or configure it as a project level setting.
Enhanced Data Quality Capabilities
The new suite of capabilities focuses on making data quality assessment, remediation and monitoring more intelligent and efficient. These new capabilities are designed to help address data quality issues that hinder the success of analytics, machine learning, and other data management initiatives within the Google Cloud Platform.
If you want to learn more about these new capabilities and experience them live, we’ll be demonstrating them at Google Next SF 2019 next week. Please join us at booth S1623 next to the Serverless Analytics section.
Run Cloud Dataprep jobs in different regions or zones with customized execution options
Optimal performance has always been the major driver to optimize data processing performance by executing closest to where the data resides. More recently, last year’s roll out of GDPR increased the importance with additional legal requirements to comply with data locality.
Specifically, data locality requires that certain data (mostly customer data and Personally Identifiable Information – PII – data) to remain within the borders of a particular country or region. While not new, the laws such as those in the European Union carry a significant cost, meaning not complying can be very high. As the majority of our Dataprep users are found outside of the US, it has become increasingly important to ensure that when physical data is processed and stored it stays within specific geographical regions. This also applies to US companies or other countries processing European Community customer data.
Prior to this release, customers who wanted to run their Dataprep job in a specific region were required to use Cloud Dataflow templates to execute. While effective, it was cumbersome to setup and maintain, and wasn’t available for scheduled jobs. Now, the most commonly used settings for Cloud Dataflow are available directly in the Dataprep job UI. Users can directly configure the location where Dataprep jobs will be executed to match the data storage locations defined for Cloud Storage and BigQuery.
You can select the regional end point and specific zone that the Dataprep job will submit to the Cloud Dataflow service to initialize and start processing the data. This ensures that the underlying Cloud Dataflow job can execute in the same location where the source and target decide, thereby improving performance (network) and maintaining geo-locality.
Expanding on these options, we’ve also enabled selected GCP compute engine machine types used by Cloud Dataflow. This is particularly useful for transformations that are processing-intensive such as joins, aggregations and flat aggregations, sessioning, windowing, unpivoting, or pivoting that benefit from more power on a single node. Note: Auto-scaling is turned on for all supported machine-types.
All of these Dataflow execution options are saved for each Dataprep job so that scheduled jobs will automatically pick up the latest settings. In addition, you can configure these settings at the project level within the user profile settings.
To learn more about Dataflow regional endpoints and zones, please see the product documentation here.
New and enhanced Data Quality capabilities
Gartner Inc. has determined that 40% of all failed business initiatives are a result of poor quality data and data quality effects overall labor productivity by as much as 20%. With more data being utilized in analytics and other organizational initiatives, brings more risks that inaccurate data will get incorporated into analytic pipelines, leading to flawed insight. In order to truly capitalize on the unprecedented business opportunity of machine learning and AI, organizations must ensure their data meets high standards of quality.
The new/enhanced features to further support data quality initiatives on Google Cloud include:
- A new Selection Model creates a seamless experience that highlights data quality issues and offers interactive guidance on how to resolve these issues.
- Column selection provides expanded histograms, data quality bars, and pattern information to offer immediate insight to column distributions and data quality issues. These visuals update with every change to the data and offer instant previews of every transformation step.
- Interaction with profiling information drives intelligent suggestions and methods for cleaning that the user can choose from.
Cluster Clean uses state-of-the-art clustering algorithms to group similar values and resolve them to a single standard value.
Pattern Clean handles composite data types like dates and phone numbers that often have multiple representations. It identifies the datatype patterns in the dataset and allows users to reformat all values to a chosen pattern with a single click.