In this four-part series, we’ll explore the data lake ecosystem—its various components, supporting technologies, and how to best outfit your lake for success. Our second post talked about data ingestion, and now we’ll take a deeper dive by hearing from Trifacta partner StreamSets.
Your business has decided to build a data lake, or a cost-effective, storage/compute environment that will drive analytic insights that improve how you run your business. To ensure that business analysts, data analysts and data scientists can derive maximum benefit from data wrangling, it is critical that they start with a solid foundation of consistent data. Yet ingestion is often treated as an afterthought, and the complexity of moving data from source to store is often greatly underestimated, which can lead to serious challenges in delivering value from your data lake.
One of the key facets of delivering value through a data lake is ensuring you ingest data efficiently and with confidence. Faulty ingestion pipes spewing incomplete and inaccurate data muddies the data lake at a very basic level.
When Ingestion is Overlooked
Ingestion gets inadequate consideration for two reasons. First, because it’s not as sexy as data science and data analytics, it gets less attention than the analytics side of the data lifecycle. The plumbing of your data ingestion system, while critical to delivering novel insights that you can trust, doesn’t stimulate data architects in the quite the same way as machine learning, data wrangling and the like.
Second, data ingestion is perceived to be easy. You have the data—the logs, signals and feeds—and know where it lives, so how hard can it be to move it to the data store? Well, if you only had one data flow to worry about, and you had to move the data only once, then it wouldn’t be so hard. But since your data lake will be fed by numerous tributaries of batch and streamed data, and the analytic value of the data in the lake is based on it being complete, accurate and consistent over time, your goal should not be to build pipelines per se, but rather a continual ingest operation, and this is a complex endeavor requiring planning, specialized tools and expertise.
Confronting Big Data Challenges
Once you get past the realization that ingestion must be planned out and built separately, what catches you by surprise is that the key assumptions that work for traditional transaction data are violated when it comes to big data. Most problematic of these is that you cannot expect data characteristics for big data systems to be stable, and hence you cannot “set and forget” your ingest pipelines that easily. The truth is that a key aspect of big data is data drift, defined as: the unpredictable, unannounced and unending mutation of data characteristics caused by the operation, maintenance and modernization of data source systems.
Data drift shows itself as changes to data structures (added, deleted, changed fields) and semantics (the values need to be interpreted differently downstream). You can read this white paper or watch this webinar about data drift if would like more details of its causes, implications, and solutions.
Also, the tooling for big data ingestion is immature when compared to traditional data, which have had a couple of decades to evolve into a high-functioning ecosystem. The prevailing big data ingest tools are Apache projects that were donated from or took inspiration from large data-driven internet companies like Google, Facebook and LinkedIn. Given the technical pedigree of the donor companies, it’s not surprising they tend to be low-level developer-centric frameworks that cannot themselves adapt to changes in the data.
Not recognizing these challenges, the approach often taken to data lake ingestion is to assign a data engineer or two to code up some pipelines using Sqoop, Flume or Kafka. And that’s fine until the sources start proliferating and data starts changing. These changes combined with an explosion of sources means that data engineers spend all of their time patching their low level code.
The bigger issue is being confident in the completeness and consistency of your data as it lands in the data lake. If you’re lucky, data drift causes your ingest pipeline to break loudly, and you can re-architect to embrace this new reality. But often (and worse), these changes to structure or semantics aren’t detected and silently corrode the data and compromise your analysis. In this case data drift leads to false insights that can drive bad decisions and when discovered belatedly—through analytic output that just “doesn’t make sense”—doubt seeps into the validity of any of the data in the lake.
Enabling Effective Ingestion
How should you think about data lake ingestion in the face of this reality? Here are a few recommendations:
1) Treat data ingestion as a separate project that can support multiple analytic projects. Design a data flow architecture that treats each data source as the start of a separate swim lane. Even if there is only one consumer for a data source today, the power of the data lake concept is that you can leverage all of your sources to any consuming application, some of which have not yet been conceived.
2) Embrace change. Recognize your working in an environment plagued by data drift and think through how you will deal with changes large and small. What happens when a field is added, moved or deleted? Can you deal with pipeline breakage and for how long? What about data semantics change, how will you know? What about when you have to upgrade your message queue or data store; can you do this without a data or service blackout?
3) Uplevel your tooling. Check out StreamSets Data Collector, a new open source integrated development environment with a visual UI for designing and operating data flows. It simplifies pipeline development (where custom coding is the exception and not the rule), provides better operational visibility (with KPIs to provide early warning), but most importantly was designed based on the assumption of inexorable data drift.
In summary, it is important and non-trivial to implement an ingestion infrastructure that will enable your data lake with timely, complete and consumption-ready data. To turn a popular aphorism on its head, perhaps we must say “begin with the beginning in mind”, if we want to ensure your data lake ends are met.
Stay tuned for our next post, where we’ll explore how data ingestion and data preparation enable successful data visualization, with help from one of our partners, Zoomdata. Also, to try data prep or the data catalog out for free, you can sign up for Trifacta Wrangler.