Extending Hadoop Towards the Data Lake

Table of Contents

  1. Summary
  2. The Promise of the Data Lake
  3. Is Hadoop Ready for This?
  4. Combining Hadoop and NoSQL
  5. Keeping the Data Lake Clean
  6. Key Takeaways
  7. About Paul Miller

1. Summary

The data lake has increasingly become an aspect of Hadoop’s appeal. Referred to in some contexts as an “enterprise data hub,” it now garners interest not only from Hadoop’s existing adopters but also from a far broader set of potential beneficiaries. It is the vision of a single, comprehensive pool of data, managed by Hadoop and accessed as required by diverse applications such as Spark, Storm, and Hive, that offers opportunities to reduce duplication of data, increase efficiency, and create an environment in which data from very different sources can meaningfully be analyzed together.

Fully embracing the opportunity promised by a comprehensive data lake requires a shift in attitude and careful integration with the existing systems and workflows that Hadoop often augments rather than replaces. Existing enterprise concerns about governance and security will certainly not disappear, so suitable workflows must be developed to safeguard data while making it available for newly feasible forms of analysis.

Early adopters in a range of industries are already finding ways to exploit the potential of their data lakes, operationalizing internal analytic processes and integrating rich real-time analyses with more established batch processing tasks. They are integrating Hadoop into existing organizational workflows and addressing challenges around the completeness, cleanliness, validity, and protection of their data.

In this report, we explore a number of the key issues frequently identified as significant in these successful implementations of a data lake.

Key findings in this report include:

  • As Hadoop continues to move beyond its MapReduce-based origin, its potential as a source of data for multiple applications and workloads—a data lake—grows more persuasive.
  • Operational workloads, which are an important aspect of most large organizations’ data processing requirements, place very different requirements on an IT infrastructure than the analytical batch processing duties traditionally associated with Hadoop.
  • Even when fully implemented, a Hadoop-based data lake augments rather than replaces existing IT systems of record such as the enterprise data warehouse.
  • Hadoop’s code is being hardened and enhanced in order to cope with the increasingly stringent requirements associated with security, compliance, and audit functions. Progress in all of these areas was required before commercial adopters—especially in heavily regulated sectors such as finance and health care—were comfortable deploying Hadoop for key workloads.

Thumbnail image courtesy of nadia/istock.