Making the Most of Your BigQuery Investments for Scalable Data Engineering Pipeline
When we released BigQuery Pushdown for Dataprep on Google Cloud back in April, we knew that it was a highly anticipated ELT (Extract Load & Transform) feature that would help both design time and processing time. However, we did not expect it to be adopted so quickly. Our internal benchmark of 20x job acceleration was […]
Introducing the Trifacta Python SDK
Background In recent years, Python has become one of the most popular object-oriented programming languages. Whether you are a beginner or an experienced programmer, Python’s simple, easy-to-learn syntax enables quick readability and integration with heterogeneous systems. This simple method of programming makes Python very attractive for scripting as well as connecting different components of software […]
Back to SQL: Data Engineering
As part of growing our massive new Data Science program at Berkeley, it became clear that we needed to target a class specifically for Data Engineering. The goals of Data Engineering are different than Software Engineering. So it was interesting to think through this curriculum and how we would teach it differently than our established database classes.
In this new approach, we ended up emphasizing four steps to SQL for Data Engineering that are atypical of a traditional databases class: data quality, data reshaping, “spreadsheet tasks,” and data pipeline testing.
Transformation: Next Level SQL
When we use SQL for Transformation—the “T” in ELT—the focus changes. In this case, we’re taking many messy and disparate tables and manipulating them into a more usable or common form. To take our example from before, we may be extracting and loading sales data from 17 electronics chains that sold the phones, and our job in SQL is to write transformation queries that integrate that data together.
SQL Pipelines and ELT
ELT is increasingly attractive these days. Modern data warehouses are flexible and increasingly cost-effective, allowing us to store large volumes of data—even messy data that includes volumes of text and images. In this environment, transformations occur in the data warehouse, where the native language is SQL.
Summer of SQL: Why It’s Back
For the first decades of the Millenium, it seemed like the Java-centric approach was the "hot new thing," but SQL has been roaring back. Today, SQL seems to be the focus of every data engineering conversation and popping back up on billboards in Silicon Valley.
The comparison of the two "shops" inevitably leads to the question: which is better? There are pros and cons to emphasizing one or the other.
Tracking Dataprep Metadata and Profile Results with Google Cloud Data Catalog
Maintain BigQuery data lineage by enriching Google Cloud Data Catalog tags with Dataprep metadata and profiling results Cataloging Dataprep Pipelines Google Cloud Data Catalog is the defacto metadata cataloging solution for your analytics initiatives on Google Cloud. Data Catalog natively and automatically captures BigQuery datasets, tables, and views, which gives you visibility into your data […]
Data Preparation for the Lakehouse
The Lakehouse represents a new way of implementing a data architecture. It combines the best benefits of data warehouse and data lake architectures. In particular, a Lakehouse combines the high performance and ease of use of a traditional data warehouse with the flexibility and low cost of a data lake. However, an organization seeking to […]
It Takes a Village to Raise a Cloud Analytics Platform
As a person passionate about technology and working to help customers better deliver on their mission goals, I spend a lot of time thinking about patterns across the myriad of customers I’m fortunate enough to meet and work with on projects. I’m continually learning, assessing and working to understand things and as I come across […]
The Road to Data Preparation
Over the course of my career, I’ve used my fair share of technologies to clean and transform data. I enjoy coding and love SQL. I built ETL pipelines for 10 years. I still use Excel (and Google Sheets) quite often in my product marketing role at Trifacta. When I first discovered data preparation technologies, I […]
Be a part of our internationally growing team.
Join The Team
Trifacta’s Partner Databricks Announces GA Launch on Google Cloud
Today, our partner Databricks announced their GA launch on Google Cloud. We are very excited to have Databricks join Dataprep by Trifacta on Google Cloud Platform Marketplace. This new service will provide a simple, open lakehouse platform for data engineering, data science, analytics, and machine learning with tight integrations to Google Cloud’s analytics solutions. For […]
The Different Approaches to “T” in ELT and What’s Required to Drive Mass Adoption
Much has been written about the shift from ETL to ELT and how ELT enables superior speed and agility for modern analytics. One important move to support this speed and agility is creating a workflow that enables data transformation to be exploratory and iterative. Preparing data for analysis requires an iterative loop of forming and […]
What Is ETL? ETL vs. ELT vs. Data Wrangling in the Cloud
Is ETL dead? Did ELT take over or is something new taking its place? It’s a question that has come up a lot in recent years as organizations modernize their analytics infrastructure. Huge shifts are underfoot in the analytics landscape and it isn’t always clear where this change leaves ETL. The short answer? No, ETL […]
Google Sheets: Data Validation Tips & Tricks
Google Sheets is one of the most widely-used spreadsheet tools. Still, many of its best features go undiscovered. Let’s take a closer look at how to do data validation in Google Sheets, which is commonly used to build drop-down lists. Why data validation matters Data validation is like the analytic version of copyediting. As much […]
Easily Publish to Data Warehouses with New Rename Functions in Trifacta
Chances are you’re having to work with several different databases and data warehouses in your analytics stack. It just is what it is today. In order to get an accurate picture in your reporting you have to use everything. However, working with these different database can be like, well this: When publishing tables in different […]
How to Automatically Deploy a Google Cloud Dataprep Pipeline Between Workspaces
This article explains how to use Cloud Composer to automate Cloud Dataprep flow migration between two workspaces. This process can be leveraged for your Cloud Data Warehouse project to move from development, test, and production following what is known as Continuous Integration and Continuous Delivery (CI/CD) pipeline in agile development. At a high level, this […]
Data Preparation Best Practices for Snowflake Data Warehouses
Snowflake is a platform known for their separation of storage and compute, which makes scaling data more efficient. However, to get the most value from your investment in Snowflake’s Cloud Data Warehouse, your organization must break through the biggest bottleneck to analytics and AI: data preparation. Here are five data preparation best practices your organization […]
How to Change Date Format in Excel
When you enter a date into Microsoft Excel, the program will format it according to the default date settings. For example, if you want to enter the date February 6, 2020, the date could appear as 6-Feb, February 6, 2020, 6 February, or 02/06/2020, all depending on your settings. You may find that if you […]