See How Data Engineering Gets Done on Our Do-It-Yourself Data Webcast Series

Start Free

Speed up your data preparation with Trifacta

Free Sign Up
 

Data Engineering Glossary

All
  • A
  • B
  • C
  • D
  • E
  • F
  • G
  • H
  • I
  • J
  • K
  • L
  • M
  • N
  • O
  • P
  • Q
  • R
  • S
  • T
  • U
  • V
  • W
  • X
  • Y
  • Z

An ad-hoc query is a single-use query generally used to answer “on-the-fly” business questions for which there are no pre-written queries or standard procedures. Ad-hoc queries usually serve a single-use purpose and are not generally stored as procedures to be run again in the future. ...  

AI & analytics explainability is the idea that humans should be able to understand and interpret how AI/ML models arrive at their predictions. Explainability occurs when there is transparency that gives humans the ability to question the inner workings of machine learning models. Organizations pursu ...  

Alternative data is data that is drawn from non-traditional sources and allows for additional insights when combined with traditional data sources. Examples of alternative data include web scraped data, satellite data, credit card transactions, product reviews, and geolocation data. Alternative data ...  

An Analytics Engineer’s job is to build the bridge between data engineering and data analysis, maintaining well-tested, well-documented, and up-to-date datasets that the rest of the company can use for analytics. Analytics Engineers enable self-service analysis by providing the infrastructure that ...  

Automated machine learning (AutoML) is the automation of the manual tasks involved in the building and training of machine learning models. It improves efficiency by making machine learning more accessible to non-experts and enabling data scientists to spend less time building machine learning pipel ...  

Batch processing refers to the scheduling and processing of large volumes of data simultaneously, generally at periods of time when computing resources are experiencing low demand. Batch jobs are typically repetitive in nature and are often scheduled (automated) to occur at set intervals, such as at ...  

Big Data Analytics is a new set of data analysis methods that enable the examination of large, often complex sets of raw data. Unlike traditional analytics, which were generally optimized around small volumes of relationally structured data, big data analytics methods are designed to handle vast qua ...  

A cloud data warehouse is a database that is managed as a service and delivered by a third party, such as Google Cloud Platform (GCP), Amazon Web Services (AWS), or Microsoft Azure. Cloud data architectures are distinct from on-premise data architectures, where organizations manage their own phy ...  

Customer analysis is the process of determining who is most likely to buy or utilize a company’s product or services. ...  

Data aggregation is the process of compiling data (often from multiple data sources) to provide high-level summary information that can be used for statistical analysis. An example of a simple data aggregation is finding the sum of the sales in a particular product category for each region you opera ...  

Data Analysts are responsible for analyzing and interpreting data to answer questions and solve business problems. Their day-to-day tasks involve collecting data that is relevant to current business questions, cleaning and transforming that data into a format that is useful for analysis, and crea ...  

Data applications are applications built on top of databases that solve a niche data problem and, by means of a visual interface, allow for multiple queries at the same time to explore and interact with that data. Data applications do not require coding knowledge in order to procure or understand th ...  

Data blending is the process of bringing data together from different sources to create a unified dataset for visualization or analysis. Data blending is similar to data joining, but blending goes a step further by combining data from separate tools. During data blending, data may be brought togethe ...  

A data catalog is a comprehensive collection of an organization’s data assets, which are compiled to make it easier for professionals across the organization to locate the data they need. Just as book catalogs help users quickly locate books in libraries, data catalogs help users quickly search or ...  

Data cleansing, data cleaning or data scrubbing is the first step in the overall data preparation process. It is the process of analyzing, identifying and correcting messy, raw data. Data cleaning involves filling in missing values, identifying and fixing errors and determining if all the informatio ...  

A data dictionary is a collection of the technical names, definitions, and attributes used for data elements and models across an organization. Using a shared dictionary ensures that all data elements have the same quality, meaning, and relevance for all team members. Data dictionaries are helpful r ...  

Data discovery compiles data from multiple sources, and then configures the data so it can be understood and examined. The steps in data discovery can be broken down a few different ways, but they all include (1) data preparation, (2) data visualization and (3) data analysis. Completing the data dis ...  

Data drift refers to a change in data structure or meaning that can occur over time and cause machine learning models to break. It occurs frequently when ML models seek to describe continually changing (dynamic) circumstances or environments. For example, a ML model could be trained to identify reck ...  

Data Engineers are the individuals in an organization responsible for setting up the data infrastructure, overseeing the data processes, and building the data pipelines that convert raw data into consumable data products. They set up and maintain the systems that are used to collect, store, manage, ...  

Data enrichment is the process of combining first party data from internal sources with disparate data from other internal systems or third party data from external sources. The data enrichment process makes data more useful and insightful. A well-functioning data enrichment process is a fundamental ...  

Data exploration is one of the initial steps in the analysis process that is used to begin exploring and determining what patterns and trends are found in the dataset. An analyst will usually begin data exploration by using data visualization techniques and other tools to describe the characteristic ...  

A data fabric is an architectural design that enables connection to data regardless of where it is stored. This makes it possible to store data in separate “siloed” data lakes or data warehouses, each with localized control and governance, while still allowing users to perform queries across the ...  

A Data Glossary (often called a Business Glossary) is a central document that defines how business terms are used across an organization. It ensures that the organization has a single source of truth for key concepts and terms. This helps avoid misunderstandings when it comes to data quality, dat ...  

Data governance is the collection of policies, processes and standards that define how data assets can be used within an organization and who has authority over them. Governance dictates who can use what data and in what way. This ensures that data assets remain secure and adhere to agreed upon qual ...  

Data ingestion is the process of transporting data from its original source to a data storage medium, such as a data lake, data mart, or data warehouse. In data ingestion, data can come from a wide variety of sources, such as clickstreams, spreadsheets, sensors, APIs, or other databases, to name ...  

Data integration is the process of gathering data from multiple locations and combining it into one view. It’s the process of consolidating data with the intent of providing consistent access and delivery of the information. ...  

Data integrity refers to the accuracy and consistency of data over its entire lifecycle, as well as compliance with necessary permissioning constraints and other security measures. In short, it is the trustworthiness of your data. High data integrity means data hasn’t been altered, corrupted or mi ...  

A data lake is a data storage location optimized to hold large amounts of raw data at a low cost. The inexpensive, scalable nature of data lakes makes it possible for organizations to store large quantities of data without worrying about high storage expenses. This makes it more cost effective to st ...  

A data lakehouse is a data management architecture that seeks to combine the strengths of data lakes with the strengths of data warehouses. The idea behind the data lakehouse is to merge the cheap, reliable storage of data lakes with the powerful data management and data structure capabilities found ...  

Data Lineage is defined as a data lifecycle that includes the data’s origins and where it moves over time. The ability to track, manage and view data lineage helps simplify tracking errors back to the data source and it helps debugging the data flow process. By tracking and utilizing the data line ...  

A data mart is a segment of a data warehouse containing data that aligns with a particular team or business unit in an organization, such as sales, finance, or marketing. The primary purpose of a data mart is to provide easy access to the data needed by individual business units while limiting acces ...  

A data mesh is a new approach to designing data architectures. It takes a decentralized approach to data storage and management, having individual business domains retain ownership over their datasets rather than flowing all of an organization’s data into a centrally owned data lake. Data is acces ...  

Data mining is the process of extracting value from large data assets. Data mining also includes the presentation of this information intended to create action or provide new insight, meaning that data mining isn’t complete until its lessons are internalized by its consumers, the decision makers i ...  

Data modeling is the process of visualizing and representing data for storage in a data warehouse. The model is a conceptual representation of an organization's data elements and how they interrelate. This can be accomplished through diagrams, symbols, or text that represent data relationships. Data ...  

Data munging is the process of manual data cleansing prior to analysis. It is a time consuming process that often gets in the way of extracting true value and potential from data. In many organizations, 80% of the time spent on data analytics is allocated to data munging, where IT manually cleans th ...  

Data observability refers to the ability of an organization to monitor, track, and make recommendations about what’s happening inside their data systems in order to maintain system health and reduce downtime. Its objective is to ensure that data pipelines are productive and can continue running wi ...  

Data onboarding is the process of preparing and uploading customer data into an online environment. It allows organizations to bring customer records gathered through offline means into online systems, such as CRMs. Data onboarding requires significant data cleansing to correct for errors and format ...  

DataOps refers to the data management practices that help an organization deliver high quality data pipelines with speed and precision. It involves working with people, technology, and processes to remove obstacles and unnecessary complexity at every stage of the data lifecycle and allow data teams ...  

A data pipeline is a sequence of steps that collect, process, and move data between sources for storage, analytics, machine learning, or other uses. For example, data pipelines are often used to send data from applications to storage devices like data warehouses or data lakes. Data pipelines are ...  

Data preparation is the process of cleaning, structuring and enriching raw data into a desired output for analysis. It’s commonly referred to as “janitorial work,” but is enormously important and mission-critical to ensure robust, accurate downstream analytics. Properly conducted, data prepara ...  

Data profiling is the process of evaluating the contents and quality of data. It is used to identify data quality issues at the start of a data project and define what data transformation steps may be needed to bring the dataset into a ready-to-use state. Data profiling checks for accuracy, complete ...  

Data quality is a measure of how well data meets the requirements of an organization for an intended purpose. High quality data that is accurate, formatted correctly, consistent, and easy to process is useful for analysis and machine learning. On the other hand, data that is incomplete, inconsistent ...  

A Data Stack is the set of tools and technologies an organization uses to accomplish their data workflows, such as analytics, data science, data engineering, and machine learning. In modern data stacks, technologies have been developed that specialize in handling specific stages within data pipeline ...  

Data stewards are the people within an organization who ensure adherence to data laws and internally established data governance policies. Their area of responsibility addresses issues such as data quality, accessibility, usability, and security. Data stewards are the go-to-experts on data within a ...  

Data streaming is the processing of data in real-time, as soon as it is generated, on a record-by-record basis. This allows for live monitoring of important information, enabling businesses to respond quickly when necessary. Examples of data streams include social media activity, online video game a ...  

Data transformation is the process of converting data into a different format that is more useful to an organization. It is used to standardize data between data sets, or to make data more useful for analysis and machine learning. The most common data transformations involve converting raw data into ...  

Data validation is the process of ensuring that your data is accurate and clean. Data validation is critical at every point of a data project’s life—from application development to file transfer to data wrangling—in order to ensure correctness. Without data validation from inception to iterati ...  

A data warehouse is a data storage technology that brings together data from multiple sources into a single system. It serves as a centralized data hub holding large amounts of historical data that users can query for the purpose of analytics.

While data warehouses are very useful for an ...  

Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. Data wrangling tools are increasingly being used to enable self-service analytics, allowing analysts to tackle more complex data more quickly, produce more acc ...  

Distributed computing is a processing technique that connects multiple computers and uses them as a single system, treating entire computer networks as if they were a single computer. This setup allows for strong, scalable processing power and lower overall processing costs. One of the primary reaso ...  

Exploratory data analysis is the process of summarizing key characteristics of data to help develop an informed hypothesis. Exploratory data analysis is like taking stock of your kitchen before cooking a meal—assessing which ingredients you have in your pantry and how they might coincide with each ...  

A metadata repository is a database created for the purpose of storing and sharing metadata. Metadata is what describes data, what data contains, how data is structured, and where data is located within a storage device. Data warehouses use metadata repositories as roadmaps to help users quickly loc ...  

MLOps refers to the set of practices that help an organization deploy, monitor, and manage machine learning models with speed and precision. It encompasses the people, processes, and technologies involved in developing ML algorithms and putting them into production. Generally speaking, MLOps seeks t ...  

A machine learning (ML) pipeline is a sequence of automated steps used to train and deploy a machine learning model. These steps typically include data extraction, data processing, model training, model deployment, model validation, and model re-training, with the last three steps being continuously ...  

A regex (short for regular expression) is a sequence of characters used to specify a search pattern. It allows users to easily conduct searches matching very specific criteria, saving large amounts of time for those who regularly work with text or analyze large volumes of data. An example of a regex ...  

Reverse ETL is a process that transmits data from storage systems like data warehouses or data lakes into operational systems such as SaaS tools or CRMs. One example of reverse ETL could be developing a pipeline that brings lead scoring information into Salesforce. By using reverse ETL, teams can el ...  

A User Defined Function (UDF) is a custom programming function that allows users to reuse processes without having to rewrite code. For example, a complex calculation can be programmed using SQL and stored as a UDF. When this calculation needs to be used in the future on a different set of data, rat ...