Slide 1

Slide 1 text

Into the Storm An introduction to and overview of Apache Storm Oliver Hall, Engineer, MetaBroadcast

Slide 2

Slide 2 text

What is Storm? ● "free and open source distributed realtime computation platform" ● tasks made from nodes, spread over multiple physical hosts ● at-least-once guarantee for message processing ● fault tolerant

Slide 3

Slide 3 text

Who uses it? ● Twitter ● Groupon ● Ooyala ● Taobao ● Alibaba ● and, of course... MetaBroadcast

Slide 4

Slide 4 text

What are we using it for? ● labelling ● statistics ● impressions counting ● and potentially much more...

Slide 5

Slide 5 text

History ● developed by Nathan Martz and BackType ● acquired by Twitter ● initial release in September 2011 ● currently at version 0.8, still under development

Slide 6

Slide 6 text

So, what is it?

Slide 7

Slide 7 text

Overview ● run on cluster of machines ● consists of topologies (continuously running processing tasks) ● cluster is a series of nodes ○ master ○ one or more workers

Slide 8

Slide 8 text

Master Node ● master node runs Nimbus ● distributes code around the cluster ● monitors for failures

Slide 9

Slide 9 text

Worker Node ● runs a Supervisor ● listens for work assigned to their machine ● starts / stops worker processes as necessary ● each worker process runs sub-section of a topology ● topology :- multiple worker processes spread across several machines

Slide 10

Slide 10 text

Zookeeper ● Nimbus and Supervisors communicate via Zookeeper cluster

Slide 11

Slide 11 text

Topologies ● a graph of computation nodes ● links show data flow

Slide 12

Slide 12 text

Spouts ● source of data ● reliable or unreliable ● can emit tuples to one or more streams data source data source ... data source

Slide 13

Slide 13 text

Bolts ● where all Storm processing occurs ● filters, aggregations, functions, database calls, and more input tuples processing output tuples ... output tuples

Slide 14

Slide 14 text

Topologies (again) ● topologies tell bolts and spouts where to send their data ● can parallelize every step N.B. you can define spouts, bolts and topologies in many languages, including non-JVM languages

Slide 15

Slide 15 text

TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());

Slide 16

Slide 16 text

public class RandomSentenceSpout extends BaseRichSpout { ... @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } ... @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }

Slide 17

Slide 17 text

public static class WordCount extends BaseBasicBolt { Map counts = new HashMap(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Slide 18

Slide 18 text

Trident ● a high-level abstraction on top of Storm ● gives high-level throughput with stateful stream processing (think Pig or Cascading) ● exactly-once message semantics

Slide 19

Slide 19 text

Only-once ● tuples are processed in small batches ● each batch has a unique batch id ● state updates are ordered

Slide 20

Slide 20 text

Trident Topologies ● a stream is channelled through a number of processing stages ● each stage can be a filter, aggregation, function, or other similar process ● sounds familiar? ● individual steps are combined into spouts / bolts at runtime

Slide 21

Slide 21 text

What are functions? ● basic building blocks in trident input tuples output tuples processing public class Split extends BaseFunction { public void execute(TridentTuple tuple, TridentCollector collector) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { collector.emit(new Values(word)); } } }

Slide 22

Slide 22 text

Example Trident Topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Slide 23

Slide 23 text

DRPC ● Distributed Remote Procedure Calls ● executes an RPC across a Storm Cluster ● query transformed into tuple, then flows through topology

Slide 24

Slide 24 text

Trident State ● means of persistence, either in memory or in store such as Cassandra ● state updates are idempotent in the face of retries or failures

Slide 25

Slide 25 text

Trident Spouts ● can be one of three types ○ Non-transactional ○ Transactional ○ Opaque Transactional

Slide 26

Slide 26 text

Achieving exactly-once semantics Spout Non- transactional Transactional Opaque Transactional State Non- transactional No No No Transactional No Yes Yes Opaque Transactional No No Yes

Slide 27

Slide 27 text

Issues ● Storm does have some negative points ○ lack of documentation ○ logging issues ○ no testing framework for Trident ○ rapidly changing ● however, it is still early days

Slide 28

Slide 28 text

Summary ● Storm is a realtime scalable, resilient computation platform ● Trident offers extremely good message guarantees ● still an evolving technology - much may change

Slide 29

Slide 29 text

Thank you! Any questions? images from the Storm Tutorial - https://github. com/nathanmarz/storm/wiki/Tutorial