Apache Storm Introduction

STORM Southeast University, Data & Intelligence Lab. Weiwei SUN, [email protected]

Outline • Background • Storm Architecture • Storm Primitives •
Certain Implementation Detail • Programming framework • Storm use-cases 29 November 2014 Welcome to Southeast Universty 2

How to design a real-time computation system • Low Latency.
• High Performance • Distributed • Fault Tolerant. • Scalable. 29 November 2014 Welcome to Southeast Universty 3 • Easy for developers. • Messages without lost. • Messages in sequence. Not enough!

What is Storm •Apache Storm is a free and open
source distributed real-time computation system. •Storm makes it easy to reliably process unbounded streams of data. •Storm does for real-time processing what Hadoop did for batch processing. •Simple, can be used with any programming language. •Similar platforms: S4, Storm, MillWhell etc. 29 November 2014 Welcome to Southeast Universty 4

Who use Storm 29 November 2014 Welcome to Southeast Universty
5

Strom & Hadoop 29 November 2014 Welcome to Southeast Universty
6 Strom is to real-time computation what Hadoop is to batch computation.

Storm VS Hadoop Hadoop Storm Components JobTracker Nimbus TaskTracker Supervisor
Child Worker Applications Job Topology Primitives Mapper/Reducer Spout/Bolt 29 November 2014 Welcome to Southeast Universty 7

What situations is Storm suit for •Streaming process & Distributed
RPC •Data comes from: • Logs • Sensors • Stock trade • Personal devices • Network connections • Etc… 29 November 2014 Welcome to Southeast Universty 8

Batch Data Processing Architecture 29 November 2014 Welcome to Southeast
Universty 9

Data Processing Architecture: Batch and Real-time 29 November 2014 Welcome
to Southeast Universty 10

Storm Architecture 29 November 2014 Welcome to Southeast Universty 11

Storm Components 29 November 2014 Welcome to Southeast University 12
• A Storm cluster has 3 sets of nodes: • Nimbus node (master) • Uploads computations for execution • Distributes code across the cluster • Launches workers across the cluster • Monitors computation and reallocates workers as needed • Zookeeper nodes • Coordinates the Storm cluster • Supervisor nodes • Communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus

Storm Topology 29 November 2014 Welcome to Southeast Universty 13
• The work is delegated to different types of components that are each responsible for a simple specific processing task. • The input stream of a Storm cluster is handled by a component called a spout. • The spout passes the data to a component called a bolt, which transforms it in some way. • A bolt either persists the data in some sort of storage, or passes it to some other bolt.

Storm Primitives • Streams • Spouts • Bolts • Topologies
29 November 2014 Welcome to Southeast Universty 14

Tuples & Streams 29 November 2014 Welcome to Southeast Universty
15 Unbounded sequence of tuples

Spouts 29 November 2014 Welcome to Southeast Universty 16 Sources
of streams  Read from a kestrel/kafka queue. {tuples = events}  Read from a http server log. {tuples = http requests}  Read from twitter streaming api. {tuples = tweets} Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter

Bolts 29 November 2014 Welcome to Southeast Universty 17 Process
input stream  Filtering tuples in a stream  Aggregation of tuples  Joining multiple streams  Arbitrary functions on streams  Communication with external caches/dbs. Produce output stream Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter

Topology 29 November 2014 Welcome to Southeast Universty 18 Directed-acyclic-graph(DAG)
of spouts and bolts Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter

Storm Cluster 29 November 2014 Welcome to Southeast Universty 19

Topology submission 29 November 2014 Welcome to Southeast Universty 20
$ storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2

Supervisor • Runs on slave nodes • Co-ordinates with zookeeper
• Manages workers 29 November 2014 Welcome to Southeast Universty 21

Stream Grouping • Shuffle grouping – random distribution of tuples.
• Field grouping – groups tuples by a field. • All grouping – replicates to all tasks. • Global grouping – sends the entire stream to one task. 29 November 2014 Welcome to Southeast Universty 22

e.g. Streaming word-count 29 November 2014 Welcome to Southeast Universty
23 Set topology Submit topology

Tweet spout 29 November 2014 Welcome to Southeast Universty 24

Parse bolt 29 November 2014 Welcome to Southeast Universty 25

Word count bolt 29 November 2014 Welcome to Southeast Universty
26

Word count topology 29 November 2014 Welcome to Southeast Universty
27 Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter

Guaranteeing Message Processing 29 November 2014 Welcome to Southeast Universty
28

Delivery Guarantees •At least once delivery. •At most once delivery
(S4). •Exact once delivery (Storm) • E.g. CPC of search engine for advisements • Convenience for App developers 29 November 2014 Welcome to Southeast Universty 29 CPC, Cost Per Click

Delivery Guarantees in Storm 1. Storm attaches a randomly generated
64-bit “message ID” to each new tuple that flows through the system. 2. New tuples can be produced when processing a tuple; (e.g. a tuple that contains an entire Tweet is split by a bolt into a set of trending topics, producing one tuple per topic for the input tuple.) such new tuples are assigned a new random 64-bit id, and the list of the tuple ids also retained in a provenance tree that is associated with the output tuple. 3. When a tuple finally leaves the topology, a backflow mechanism is used to ack. the tasks that contributed to that output tuple. This backflow mechanism eventually reaches the spout that started the tuple processing in the first place, at which point it can retire the tuple. 29 November 2014 Welcome to Southeast Universty 30

Backflow Mechanism Implementation • Naï ve implementation: • Keeping track
of the lineage for each tuple. This means that for each tuple, its source tuple ids must be retained till the end of the processing for that tuple. • Novel implementation: • Using bitwise XORs. New message ids are XORed and sent to the acker bolt along with the original tuple message id and a timeout parameter. Thus, the acker bolt keeps track of all the tuples. • When the processing of a tuple is completed or acked, its message id as well as its original tuple message id is sent to the acker bolt • The acker bolt locates the original tuple and its XOR checksum. This XOR checksum is again XORed with the acked tuple id. When the XOR checksum goes to zero, the acker bolt sends the final ack to the spout that admitted the tuple. • The spout now knows that this tuple has been fully processed. 29 November 2014 Welcome to Southeast Universty 31

Example 29 November 2014 Welcome to Southeast Universty 32 Two
source tuple: 0010(id) and 1011(id)

Example (cont.) 29 November 2014 Welcome to Southeast Universty 33
Bolt 1 process tuple 0010, and generate a new tuple, ID 0110

Bolt 2 process tuple 1011, and generate a new tuple, ID 0111

Bolt 3 process tuple 0110 and 0111 without new tuples generated!

Storm Reliability API 29 November 2014 Welcome to Southeast Universty
36

Why Storm is ideal for Real Time Processing • Fast
– benchmarked as processing one million, 100 byte messages, per second per node. • Scalable – with parallel calculations that run across a cluster of machines. • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures. • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate. 29 November 2014 Welcome to Southeast Universty 37 Source: http://hortonworks.com/hadoop/storm/

Storm in the Hadoop Framework 29 November 2014 Welcome to
Southeast Universty 38 http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html Storm & Hadoop are complementary! Hadoop => big batch processing Storm => fast, reactive, real time processing

Storm use-cases @twitter 29 November 2014 Welcome to Southeast Universty
39 Stream processing applications

Current use-cases • Discovery of emerging topics/stories. • Online learning
of tweet features for search result ranking. • Real-time analytics for ads. • Internal log processing. 29 November 2014 Welcome to Southeast Universty 40

Strom use-cases @Taobao •Real-time server log process •Real-time analytical •Real-time
recommending 29 November 2014 Welcome to Monash College 41 Source: http://storm.apache.org/documentation/Powered-By.html

Resources • Storm tutorial: http://hortonworks.com/hadoop-tutorial/ingesting-processing-real- time-events-apache-storm/ • Storm users: http://storm.apache.org/documentation/Powered-By.html
29 November 2014 Welcome to Southeast Universty 42

Apache Storm Introduction

Apache Storm Introduction

More Decks by Weiwei

Featured

Transcript