Streams processing with Storm

Slide 1

Slide 1 text

Data streams processing with STORM Mariusz Gil

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

data expire fast. very fast

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

realtime processing?

Slide 6

Slide 6 text

Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Slide 7

Slide 7 text

Storm is fast, a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Slide 8

Slide 8 text

concept architecture

Slide 9

Slide 9 text

Stream (val1, val2) (val3, val4) (val5, val6) unbounded sequence of tuples tuple tuple tuple tuple tuple tuple tuple

Slide 10

Slide 10 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Slide 11

Slide 11 text

Reliable and unreliable Spouts replay or forget about touple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Slide 12

Slide 12 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kafka

Slide 13

Slide 13 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kestrel

Slide 14

Slide 14 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-AMQP-Spout

Slide 15

Slide 15 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-JMS

Slide 16

Slide 16 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-PubSub*

Slide 17

Slide 17 text

Spouts source of streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple Storm-Beanstalkd-Spout

Slide 18

Slide 18 text

Bolts process input streams and produce new streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Slide 19

Slide 19 text

Bolts process input streams and produce new streams tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Slide 20

Slide 20 text

Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence] [word] [word, count]

Slide 21

Slide 21 text

Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence] [word] [word, count] TextSpout SplitSentenceBolt [sentence] xyzBolt

Slide 22

Slide 22 text

servers architecture

Slide 23

Slide 23 text

Nimbus process responsible for distributing processing across the cluster

Slide 24

Slide 24 text

Supervisors worker process responsible for executing subset of topology

Slide 25

Slide 25 text

zookeepers coordination layer between Nimbus and Supervisors

Slide 26

Slide 26 text

fast CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS fail

Slide 27

Slide 27 text

sample code

Slide 28

Slide 28 text

Spouts public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random _rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }

Slide 29

Slide 29 text

Bolts public static class WordCount extends BaseBasicBolt { Map counts = new HashMap(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Slide 30

Slide 30 text

Bolts public static class ExclamationBolt implements IRichBolt { OutputCollector _collector; public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } }

Slide 31

Slide 31 text

Topology public class WordCountTopology { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Slide 32

Slide 32 text

Bolts public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run()

Slide 33

Slide 33 text

github.com/nathanmarz/storm-starter

Slide 34

Slide 34 text

streams groupping

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Groupping shuffle, fields, all, global, none, direct, local or shuffle

Slide 37

Slide 37 text

distributed rpc

Slide 38

Slide 38 text

RPC distributed arguments results [request-id, arguments] [request-id, results]

Slide 39

Slide 39 text

RPC distributed arguments results [request-id, arguments] [request-id, results] public static class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); }

Slide 40

Slide 40 text

realtime analytics personalization search revenue optimization monitoring

Slide 41

Slide 41 text

content search realtime analytics generating feeds integrated with elastic search, Hbase,hadoop and hdfs

Slide 42

Slide 42 text

realtime scoring moments generation integrated with kafka queues and hdfs storage

Slide 43

Slide 43 text

Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS

Slide 44

Slide 44 text

thanks! mail: [email protected] twitter: @mariuszgil