Slide 1

Slide 1 text

A Publish-Subscribe Distributed Notification System on Hadoop Jyotiska Nath Khasnabish IIIT-Bangalore

Slide 2

Slide 2 text

Hadoop ● Open source distributed framework for processing “Big Data”. ● Offers distributed file system(HDFS) for storing massive amount of data across clusters. ● MapReduce as a programming model for processing the large amount of data. ● Adopted and used in production by 1000+ companies worldwide. ● 20+ popular Hadoop-based subprojects and growing.

Slide 3

Slide 3 text

Distributed Notification System ● [HDFS-1742] talks about a system that could notify interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification. ● [HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services. ● [HDFS-7821] talks about an event notification system which – ● Provide periodic updates to subscribed users ● Provide the capability to let users specify 'interesting events'. ● Provide a 'customizable' and 'configurable' interface such that user-defined parameters can also be 'subscribed' by the user.

Slide 4

Slide 4 text

Publish Subscribe Model

Slide 5

Slide 5 text

Messaging Systems Apache ActiveMQ ● Uses JMS (Java Messaging Service) for sending and receiving messages. ● Three components – Publisher, Broker, Subscriber. ● Supports both Persistence and Non Persistence. Apache Kafka ● Developed by LinkedIn. ● Three components – Producer, Broker, Consumer. ● Supports both Persistent and Non Persistent Messaging. ● Uses Zookeeper for co- ordination.

Slide 6

Slide 6 text

Architecture

Slide 7

Slide 7 text

Use Cases

Slide 8

Slide 8 text

1. Message Passing ● Sending status flags or progress reports of running jobs among multiple Hadoop services. ● Hadoop services can take the role of either a publisher or a subscriber. ● Example – ● TaskTrackers only notifying JobTracker their status where there is a status change.

Slide 9

Slide 9 text

2. Notification for Data Availability ● Chained jobs get notified about the completion of some other job on which they are dependent. ● No need to poll the NameNode for data availability in the HDFS. ● Multiple subscribed services or jobs can be notified when the data is available.

Slide 10

Slide 10 text

3. Event Based Job Chaining ● Multiple MapReduce jobs can be chained based on events occurring in the Hadoop cluster. ● Easier for workflow managers to chain jobs and trigger workflows automatically. ● Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.

Slide 11

Slide 11 text

Cluster Configuration Machine #1 Machine #2 Machine #3 Processing Speed 2.3 GHz 2.3 GHz 2.3 GHz RAM 2 GB 2 GB 2 GB Disk Space 8 GB 8 GB 8 GB OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04 Hadoop Version 1.1.1 1.1.1 1.1.1 ActiveMQ Version 5.8.0 5.8.0 5.8.0 Kafka Version 0.8 0.8 0.8

Slide 12

Slide 12 text

Performance Analysis
 ActiveMQ vs Kafka

Slide 13

Slide 13 text

Performance Analysis
 Single Node vs Multi Node

Slide 14

Slide 14 text

Performance Comparison With and Without Notification System

Slide 15

Slide 15 text

Hadoop Cluster Load Before After

Slide 16

Slide 16 text

Network Bandwidth Consumption Before After

Slide 17

Slide 17 text

Mobile Client

Slide 18

Slide 18 text

Conclusion ● Distributed notification system based on Publish Subscribe messaging model. ● Can be used to pass messages between services, notify subscribed clients and chain multiple jobs. ● Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources. ● Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.

Slide 19

Slide 19 text

Thank you