Real-­time query for Hadoop democratizes access to big data analytics

Table of Contents

  1. Summary
  2. What’s driving the need for real-time analysis?
  3. Drivers for a more unified big data analytics platform
  4. Retail example: enhancing the customer experience
  5. Benefits of real-time query and the emerging unified big data analytics platform
  6. Impala under the covers: Hadoop as a closer complement to traditional RDBMS
  7. Toward a converged big data analytics platform
  8. Conclusion
  9. About George Gilbert

1. Summary

The delivery of real-time query makes Hadoop accessible to more users — and by orders of magnitude. Its significance goes well beyond delivering a database management system (DBMS) kind of query engine that other products have had for decades. Rather, Hadoop as a platform now supports a whole new paradigm of analytics.

Real-time query is the catalyst for delivering a new level of self-service in analytics to a much broader audience. Interactive response and the accessibility of a structured query language (SQL) interface through open database connectivity/Java database connectivity (ODBC/JDBC) make the incremental discovery and enrichment of data possible for a greater and more varied audience of users than just data scientists. Hadoop can now reach an even wider array of users who are familiar with business intelligence tools such as Tableau and MicroStrategy.

That incremental discovery and enrichment process has two other major implications. First, it dramatically shortens the time between collecting data from source applications and extracting some signal from that data’s background noise. Second, it becomes a self-enforcing exercise in crowdsourcing the process of refining meaning from the data. Both issues had previously represented major bottlenecks in the exploitation of traditional data warehouses.

Screen Shot 2013-05-28 at 8.51.01 AM

Hadoop’s traditional appeal

Historically Hadoop has been a favorite among organizations needing to store, process, and analyze massive volumes of multistructured data cost-effectively. Its primary uses have included tasks such as index building, pattern recognition across multisource data, analyzing machine data such as
sensors and communications networks, creating profiles that support recommendation engines, and sentiment analysis.

However, several obstacles have limited the scope of Hadoop’s appeal. The MapReduce programming framework only operated in batch mode, even when supporting SQL queries based on Hive. Because Hadoop was a repository that collected unrefined data from many sources — and with little structure or organization — data scientists were required to extract meaning from it.

The traditional appeal of RDBMS-based analytic applications

Relational database management systems (RDBMS) have traditionally been deployed as data warehouses for analytic applications when most of the questions were known up front. Their care and feeding required a sophisticated, multistep process and a lot of time. This process supported the need for strong information governance, verifiability, quality, traceability, and security.

Traditional data warehouses are ideal for a certain class of analytic applications. Their sweet spot includes both running the same reports and queries and tracking the same set of metrics over time. But if the questions changed, things would break and big parts of the end-to-end process would require redeveloping — often starting with the collection of new source data.

Moving toward a more unified platform for big data analytics

With the introduction of real-time query, Hadoop has taken a major step toward unifying the majority of big data analytic applications onto one platform. With that opportunity in mind, this research paper targets information technology professionals who have in-depth experience with traditional RDBMS and seek to understand where the Hadoop ecosystem and big data analytics fit.

In discussing this topic, we will address the following:

  • What’s driving the need for real-time analysis? (Real time can be broken down as either interactive or streaming.)
  • What’s driving the need for a more unified platform for big data analytics?
  • What will customers be able to do when they fully implement real-time query?
  • What are four key benefits of real-time query across customer use cases?
  • What does Impala look like under the covers?
  • How can we move toward a converged big data analytics platform?