In this four-part series, we’ll explore the data lake ecosystem—its various components, supporting technologies, and how to best outfit your lake for success. In our first post, we discussed how creating a data catalog in partnership with data wrangling instills data governance. Now, we’ll talk about the other side of data preparation: data ingestion.
Fig 1. Data Lake Block Diagram
Data ingestion is the process of flowing data from its origin to one or more data stores, such as a data lake, though this can also include databases and search engines. As you might imagine, the quality of your ingestion process corresponds with the quality of data in your lake—ingest your data incorrectly, and it can make for a more cumbersome analysis downstream, jeopardizing the value of your data altogether. When your ingest is working well, your data arrives in the lake on time, with the right fidelity, and ready for data wrangling and analytic use.
The Key Functions of Ingestion
So, what does proper ingestion look like? What are the primary objectives with each ingestion? Below, we listed the top three functions of ingestion:
- Collect data from the source
Sources can be clickstreams, data center logs, sensors, APIs or even databases. They use various data formats (structured, unstructured, semi-structured, multi-structured), can make data available in a stream or batches, and support various protocols for data movement.
- Filter and sanitize
Processing at this early stage of the data life cycle can be simple field manipulations, JSON parsing, de-duplication and masking functions. More complex operations can be executed using scripts or by calling out to external data services.
- Route to one or more data stores
Routing of data from source to data stores can be simple or complex, with routing rules based on attributes of the data, and with automatic conversion of data types and formats.
It’s important to note that these ingestion functions need to be performed as a low-latency, high-throughput, continual process, even when the characteristics of the incoming data change.
Ingestion + Data Wrangling
Ingestion and data wrangling are natural complements. Upon ingesting data, users may perform light sanitization on the source data in order to support universally-acknowledged policies, such as masking personally identifiable information or using canonical data representations, as well monitoring the inbound data flow for completeness, consistency and accuracy.
Once this data lands in the data lake, the baton is handed to data scientists, data analysts or business analysts for data preparation, in order to then populate analytic and predictive modeling tools. From a data preparation view, the ideal ingestion system will have cleaned the data as much as possible so that data preparation is primarily focused on exploration and insight for business needs. During this discovery phase, analysts may uncover new specifications and tuning rules for the ingestions process to obtain higher data sanitization standards while the data is flowing to the lake.
With a solid ingestion process in place, data should have received a basic level of sanitization once it lands in the lake. However, if users need data in the lake to be as raw as possible for compliance, it’s also possible to extend the ingestion process into the data lake, such as running a set of one-time transformations on new data as a nearline compute process in order to minimize the janitorial work required during data preparation.
Creating an Ingestion Pipeline
Ingestion has aspects of both development and operations. From a development perspective, data engineers must create ingest pipelines, or a logical connection between a source and multiple destinations. The popular methods for ingest to date have been Sqoop, Flume and Kafka, which involve custom-coding in a programming language to move data. However, this reliance on developers is evolving; Trifacta partner StreamSets, for example, has built a higher-level integrated development environment for creating and running pipelines using a visual UI, which minimizes the amount of custom-coding required.
Ingestion must also be treated as an operations process, since it involves recurring and continual data sets that are highly time-sensitive. Ingest pipelines must be monitored continually to ensure that they are not dropping data or that the data is not becoming corroded over time.
The Bottom Line
In short, data ingestion is the other side of the coin from data exploration and preparation. The adoption of both technologies can help you operationalize a smooth-running data lake that efficiently delivers insights to the business.
Want to learn more about data ingestion? Stay tuned for the next post in this series, where Trifacta partner StreamSets will go in-depth from their perspective as a data flow management software. In the meantime, sign up for Trifacta Wrangler to experience data wrangling for yourself!