Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013

C3e94ca13deeed0e7bf9615580c7d40a?s=47 Jyotiska NK
December 14, 2013

Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013


Jyotiska NK

December 14, 2013


  1. A Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish

  2. Hadoop • Open source distributed framework for processing “Big Data”.

    • Offers distributed file system(HDFS) for storing massive amount of data across clusters. • MapReduce as a programming model for processing the large amount of data. • Adopted and used in production by 1000+ companies worldwide. • 20+ popular Hadoop-based subprojects and growing.
  3. Distributed Notification System • [HDFS-1742] talks about a system that

    could notify interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification. • [HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services. • [HDFS-7821] talks about an event notification system which – • Provide periodic updates to subscribed users • Provide the capability to let users specify 'interesting events'. • Provide a 'customizable' and 'configurable' interface such that user-defined parameters can also be 'subscribed' by the user.
  4. Publish Subscribe Model

  5. Messaging Systems Apache ActiveMQ • Uses JMS (Java Messaging Service)

    for sending and receiving messages. • Three components – Publisher, Broker, Subscriber. • Supports both Persistence and Non Persistence. Apache Kafka • Developed by LinkedIn. • Three components – Producer, Broker, Consumer. • Supports both Persistent and Non Persistent Messaging. • Uses Zookeeper for co- ordination.
  6. Architecture

  7. Use Cases

  8. 1. Message Passing • Sending status flags or progress reports

    of running jobs among multiple Hadoop services. • Hadoop services can take the role of either a publisher or a subscriber. • Example – • TaskTrackers only notifying JobTracker their status where there is a status change.
  9. 2. Notification for Data Availability • Chained jobs get notified

    about the completion of some other job on which they are dependent. • No need to poll the NameNode for data availability in the HDFS. • Multiple subscribed services or jobs can be notified when the data is available.
  10. 3. Event Based Job Chaining • Multiple MapReduce jobs can

    be chained based on events occurring in the Hadoop cluster. • Easier for workflow managers to chain jobs and trigger workflows automatically. • Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.
  11. Cluster Configuration Machine #1 Machine #2 Machine #3 Processing Speed

    2.3 GHz 2.3 GHz 2.3 GHz RAM 2 GB 2 GB 2 GB Disk Space 8 GB 8 GB 8 GB OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04 Hadoop Version 1.1.1 1.1.1 1.1.1 ActiveMQ Version 5.8.0 5.8.0 5.8.0 Kafka Version 0.8 0.8 0.8
  12. Performance Analysis
 ActiveMQ vs Kafka

  13. Performance Analysis
 Single Node vs Multi Node

  14. Performance Comparison With and Without Notification System

  15. Hadoop Cluster Load Before After

  16. Network Bandwidth Consumption Before After

  17. Mobile Client

  18. Conclusion • Distributed notification system based on Publish Subscribe messaging

    model. • Can be used to pass messages between services, notify subscribed clients and chain multiple jobs. • Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources. • Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.
  19. Thank you