Apache Hadoop: Is one cluster enough?

Table of Contents

  1. Summary
  2. Hadoop at work
  3. YARN as an enabler
  4. The Hadoop Distributed File System
  5. Moving beyond a single cluster
  6. Next steps
  7. Key Takeaways
  8. About Paul Miller

1. Summary

The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.

Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.

In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.

Key findings from this report include:

  • YARN has been important in extending the range of suitable use cases for Hadoop.
  • Although mainstream Hadoop deployments still largely favor a single cluster, that simple model does not make sense for a range of technical, practical, and regulatory situations.
  • In these cases, deploying a number of independent clusters is more appealing, but this fragments the data lake and risks reducing the value of the whole approach. Techniques are now emerging to address this challenge by virtually recreating a seamless view across data stored in different physical locations.

Thumbnail image courtesy of JimmyAnderson/iStock.