Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013

Jyotiska NK
December 14, 2013

Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013

Jyotiska NK

December 14, 2013
Tweet

More Decks by Jyotiska NK

Other Decks in Research

Transcript

  1. Hadoop • Open source distributed framework for processing “Big Data”.

    • Offers distributed file system(HDFS) for storing massive amount of data across clusters. • MapReduce as a programming model for processing the large amount of data. • Adopted and used in production by 1000+ companies worldwide. • 20+ popular Hadoop-based subprojects and growing.
  2. Distributed Notification System • [HDFS-1742] talks about a system that

    could notify interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification. • [HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services. • [HDFS-7821] talks about an event notification system which – • Provide periodic updates to subscribed users • Provide the capability to let users specify 'interesting events'. • Provide a 'customizable' and 'configurable' interface such that user-defined parameters can also be 'subscribed' by the user.
  3. Messaging Systems Apache ActiveMQ • Uses JMS (Java Messaging Service)

    for sending and receiving messages. • Three components – Publisher, Broker, Subscriber. • Supports both Persistence and Non Persistence. Apache Kafka • Developed by LinkedIn. • Three components – Producer, Broker, Consumer. • Supports both Persistent and Non Persistent Messaging. • Uses Zookeeper for co- ordination.
  4. 1. Message Passing • Sending status flags or progress reports

    of running jobs among multiple Hadoop services. • Hadoop services can take the role of either a publisher or a subscriber. • Example – • TaskTrackers only notifying JobTracker their status where there is a status change.
  5. 2. Notification for Data Availability • Chained jobs get notified

    about the completion of some other job on which they are dependent. • No need to poll the NameNode for data availability in the HDFS. • Multiple subscribed services or jobs can be notified when the data is available.
  6. 3. Event Based Job Chaining • Multiple MapReduce jobs can

    be chained based on events occurring in the Hadoop cluster. • Easier for workflow managers to chain jobs and trigger workflows automatically. • Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.
  7. Cluster Configuration Machine #1 Machine #2 Machine #3 Processing Speed

    2.3 GHz 2.3 GHz 2.3 GHz RAM 2 GB 2 GB 2 GB Disk Space 8 GB 8 GB 8 GB OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04 Hadoop Version 1.1.1 1.1.1 1.1.1 ActiveMQ Version 5.8.0 5.8.0 5.8.0 Kafka Version 0.8 0.8 0.8
  8. Conclusion • Distributed notification system based on Publish Subscribe messaging

    model. • Can be used to pass messages between services, notify subscribed clients and chain multiple jobs. • Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources. • Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.