Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013

A Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish
IIIT-Bangalore

Hadoop • Open source distributed framework for processing “Big Data”.
• Offers distributed file system(HDFS) for storing massive amount of data across clusters. • MapReduce as a programming model for processing the large amount of data. • Adopted and used in production by 1000+ companies worldwide. • 20+ popular Hadoop-based subprojects and growing.

Distributed Notification System • [HDFS-1742] talks about a system that
could notify interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification. • [HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services. • [HDFS-7821] talks about an event notification system which – • Provide periodic updates to subscribed users • Provide the capability to let users specify 'interesting events'. • Provide a 'customizable' and 'configurable' interface such that user-defined parameters can also be 'subscribed' by the user.

Publish Subscribe Model

Messaging Systems Apache ActiveMQ • Uses JMS (Java Messaging Service)
for sending and receiving messages. • Three components – Publisher, Broker, Subscriber. • Supports both Persistence and Non Persistence. Apache Kafka • Developed by LinkedIn. • Three components – Producer, Broker, Consumer. • Supports both Persistent and Non Persistent Messaging. • Uses Zookeeper for co- ordination.

Architecture

Use Cases

1. Message Passing • Sending status flags or progress reports
of running jobs among multiple Hadoop services. • Hadoop services can take the role of either a publisher or a subscriber. • Example – • TaskTrackers only notifying JobTracker their status where there is a status change.

2. Notification for Data Availability • Chained jobs get notified
about the completion of some other job on which they are dependent. • No need to poll the NameNode for data availability in the HDFS. • Multiple subscribed services or jobs can be notified when the data is available.

3. Event Based Job Chaining • Multiple MapReduce jobs can
be chained based on events occurring in the Hadoop cluster. • Easier for workflow managers to chain jobs and trigger workflows automatically. • Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.

Cluster Configuration Machine #1 Machine #2 Machine #3 Processing Speed
2.3 GHz 2.3 GHz 2.3 GHz RAM 2 GB 2 GB 2 GB Disk Space 8 GB 8 GB 8 GB OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04 Hadoop Version 1.1.1 1.1.1 1.1.1 ActiveMQ Version 5.8.0 5.8.0 5.8.0 Kafka Version 0.8 0.8 0.8

Performance Analysis  ActiveMQ vs Kafka

Performance Analysis  Single Node vs Multi Node

Performance Comparison With and Without Notification System

Hadoop Cluster Load Before After

Network Bandwidth Consumption Before After

Mobile Client

Conclusion • Distributed notification system based on Publish Subscribe messaging
model. • Can be used to pass messages between services, notify subscribed clients and chain multiple jobs. • Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources. • Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.

Thank you

Implementing a Publish-Subscribe Distributed No...

Implementing a Publish-Subscribe Distributed Notification System in Hadoop - CSI 2013

Jyotiska NK

More Decks by Jyotiska NK

Other Decks in Research

Featured

Transcript

A Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish

Hadoop • Open source distributed framework for processing “Big Data”.

Distributed Notification System • [HDFS-1742] talks about a system that

Publish Subscribe Model

Messaging Systems Apache ActiveMQ • Uses JMS (Java Messaging Service)

Architecture

Use Cases

1. Message Passing • Sending status flags or progress reports

2. Notification for Data Availability • Chained jobs get notified

3. Event Based Job Chaining • Multiple MapReduce jobs can

Cluster Configuration Machine #1 Machine #2 Machine #3 Processing Speed