A guide to big data workload-management challenges

Table of Contents

  1. Summary
  2. Understanding the new class of applications
    1. New applications supporting new business models
    2. Requirements for new technology underpinnings
  3. Volume, velocity and variety of data
    1. Data volume
    2. Data velocity
    3. Data variety
  4. Real-time, massively scalable and closed-loop characteristics of applications
    1. Massively scalable
    2. Closing the loop
  5. Approaches to scalability
    1. Emerging NoSQL databases
    2. Cassandra
  6. Alternatives to Cassandra
    1. Oracle NoSQL
    2. HBase
    3. Sustainable versus nonsustainable differentiators
  7. Driving better decisions into online applications
    1. A note on the distinction between Cassandra and DataStax Enterprise
  8. Key takeaways
    1. Technology considerations
  9. Appendix A: understanding the assumptions in NoSQL databases as a class of systems relative to SQL databases
  10. About George Gilbert

1. Summary

The explosive growth in the volume, velocity, variety and complexity of data has challenged both traditional enterprise application vendors as well as companies built around online applications. In response, new applications have emerged, ones that are real-time, massively scalable and have closed-loop analytics. Needless to say, these applications require very different technology underpinnings than what came before.

Traditional applications had a common platform that captured business transactions. The software pipeline extracted, cleansed and loaded the information into a data warehouse. The data warehouse reorganized the data primarily to answer questions that were known in advance. Tying the answers back into better decisions in the form of transactions was mostly an offline, human activity.

The emerging class of applications requires new functionality that closes the loop between incoming transactions and the analytics that drive action on those transactions. Closing the loop between decisions and actions can take two forms: Analytics can run directly on the transactional database in real time or in closely integrated but offline tasks running on Hadoop. Hadoop typically supports data scientists who take in data that’s far more raw and unrefined than the type found in a traditional enterprise data warehouse. The raw data makes it easier to find new patterns that define new analytic rules to insert back into the online database.

This paper is targeted at technology-aware business executives, IT generalists and those who recognize that many emerging applications need new data-management foundations. The paper surveys this class of applications and its technology underpinnings relative to more-traditional offerings from several high-level perspectives: the characteristics of the data, the distinctiveness of the new class of applications, and the emerging database technologies — labeled NoSQL, for lack of a better term — that support them. Although the NoSQL label has been applied to many databases, this paper will focus on the class of systems with a rich data model typified by Cassandra. Other databases in this class include HBase, DynamoDB and Oracle NoSQL.