In this two-part series, we’re talking about the Hadoop data lake, both in terms of the necessary components and people involved. Our first post covered the different staging areas of the lake and what they should accomplish.
Data is flowing into businesses faster than ever. From marketing to customer support to operations, each part of the organization has their own objectives. To accommodate this growing volume and variety of data, many IT organizations are choosing to adopt data lakes. In order to move data through a Hadoop data lake, as explained in part 1, data flows through 4 zones from from landing to production.
However, in order to build a successful Hadoop data lake—one that fosters adoption across the business units—the organization needs to evolve. IT’s role in supporting the Hadoop data lake should shift from implementer and gatekeeper to enabler and trusted resource. To evolve effectively, and create successful Hadoop data lake, organizational roles must be aligned with team capabilities and resources.
Below, we’ve outlined best practices for organizational alignment around the Hadoop data lake:
Why it’s important: The landing zone preserves data in its native format, maintaining data provenance and fidelity all in real time.
Who should own it: While traditionally the landing zone has been the realm of IT, next-gen data preparation and wrangling tools, such as Trifacta, have made it easy for the business to handle their own data requirements with little IT involvement.
Why it’s important: The refinery zone is where minimally processed data with minimal security constraints is used for discovery, exploration, experimentation.
Who should own it: IT has also traditionally owned this zone to transform and standardize raw data that can’t be used as is; but with data wrangling tools such as Trifacta, the business users (primarily the data scientist or data analyst) can use it to explore the data and share datasets for team collaboration.
Why it’s important: The production zone is like the production website: where the business data is stored in a clean, structured format that informs critical business decisions and drives efficient operations. The quality of this data is highly correlated to the data preparation work done in the preceding zones.
Who should own it: Here, IT automates the business and data transformation rules to deliver controlled and validate outcomes, but the production zone should meet the needs of its users in the business units, as it’s where most will do their analyses.
Proper data preparation is critical at every stage of the process. We’ve created a helpful summary of the zones and the ownership:
The bottom line
The right self-service data preparation tools will foster adoption of the data lake and give it the best chances of success. With next generation tools, your non-technical users can access data in the big data ecosystem quickly, while also augmenting existing data governance policies AND not jeopardizing security or accuracy.
Help ensure that the business is maximizing the potential of its resources by not relegating everything to technical employees. This way, other IT objectives are not compromised; and business users will be more fully engaged with the data lake, ensuring that the entire organization will reap all of the benefits of the Hadoop data lake.
To learn more about how Trifacta fits into the context of your data lake, download our white paper, “Trifacta Data Wrangling for Hadoop: Accelerating Business Adoption While Ensuring Security & Governance”