• High Performance • Distributed • Fault Tolerant. • Scalable. 29 November 2014 Welcome to Southeast Universty 3 • Easy for developers. • Messages without lost. • Messages in sequence. Not enough!
source distributed real-time computation system. •Storm makes it easy to reliably process unbounded streams of data. •Storm does for real-time processing what Hadoop did for batch processing. •Simple, can be used with any programming language. •Similar platforms: S4, Storm, MillWhell etc. 29 November 2014 Welcome to Southeast Universty 4
• A Storm cluster has 3 sets of nodes: • Nimbus node (master) • Uploads computations for execution • Distributes code across the cluster • Launches workers across the cluster • Monitors computation and reallocates workers as needed • Zookeeper nodes • Coordinates the Storm cluster • Supervisor nodes • Communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus
• The work is delegated to different types of components that are each responsible for a simple specific processing task. • The input stream of a Storm cluster is handled by a component called a spout. • The spout passes the data to a component called a bolt, which transforms it in some way. • A bolt either persists the data in some sort of storage, or passes it to some other bolt.
of streams Read from a kestrel/kafka queue. {tuples = events} Read from a http server log. {tuples = http requests} Read from twitter streaming api. {tuples = tweets} Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter
input stream Filtering tuples in a stream Aggregation of tuples Joining multiple streams Arbitrary functions on streams Communication with external caches/dbs. Produce output stream Refer: http://www.slideshare.net/KrishnaGade2/storm-at-twitter
• Field grouping – groups tuples by a field. • All grouping – replicates to all tasks. • Global grouping – sends the entire stream to one task. 29 November 2014 Welcome to Southeast Universty 22
(S4). •Exact once delivery (Storm) • E.g. CPC of search engine for advisements • Convenience for App developers 29 November 2014 Welcome to Southeast Universty 29 CPC, Cost Per Click
64-bit “message ID” to each new tuple that flows through the system. 2. New tuples can be produced when processing a tuple; (e.g. a tuple that contains an entire Tweet is split by a bolt into a set of trending topics, producing one tuple per topic for the input tuple.) such new tuples are assigned a new random 64-bit id, and the list of the tuple ids also retained in a provenance tree that is associated with the output tuple. 3. When a tuple finally leaves the topology, a backflow mechanism is used to ack. the tasks that contributed to that output tuple. This backflow mechanism eventually reaches the spout that started the tuple processing in the first place, at which point it can retire the tuple. 29 November 2014 Welcome to Southeast Universty 30
of the lineage for each tuple. This means that for each tuple, its source tuple ids must be retained till the end of the processing for that tuple. • Novel implementation: • Using bitwise XORs. New message ids are XORed and sent to the acker bolt along with the original tuple message id and a timeout parameter. Thus, the acker bolt keeps track of all the tuples. • When the processing of a tuple is completed or acked, its message id as well as its original tuple message id is sent to the acker bolt • The acker bolt locates the original tuple and its XOR checksum. This XOR checksum is again XORed with the acked tuple id. When the XOR checksum goes to zero, the acker bolt sends the final ack to the spout that admitted the tuple. • The spout now knows that this tuple has been fully processed. 29 November 2014 Welcome to Southeast Universty 31
– benchmarked as processing one million, 100 byte messages, per second per node. • Scalable – with parallel calculations that run across a cluster of machines. • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures. • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate. 29 November 2014 Welcome to Southeast Universty 37 Source: http://hortonworks.com/hadoop/storm/
of tweet features for search result ranking. • Real-time analytics for ads. • Internal log processing. 29 November 2014 Welcome to Southeast Universty 40