Streams processing with Storm

Data streams processing with STORM Mariusz Gil

data expire fast. very fast

realtime processing?

Storm is a free and open source distributed realtime computation
system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

Storm is fast, a benchmark clocked it at over a
million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

concept architecture

Stream (val1, val2) (val3, val4) (val5, val6) unbounded sequence of
tuples tuple tuple tuple tuple tuple tuple tuple

Spouts source of streams tuple tuple tuple tuple tuple tuple
tuple tuple tuple tuple tuple tuple tuple tuple

Reliable and unreliable Spouts replay or forget about touple tuple
tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kafka

tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kestrel

tuple tuple tuple tuple tuple tuple tuple tuple Storm-AMQP-Spout

tuple tuple tuple tuple tuple tuple tuple tuple Storm-JMS

tuple tuple tuple tuple tuple tuple tuple tuple Storm-PubSub*

tuple tuple tuple tuple tuple tuple tuple tuple Storm-Beanstalkd-Spout

Bolts process input streams and produce new streams tuple tuple
tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Bolts process input streams and produce new streams tuple tuple
tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple

Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence]
[word] [word, count]

Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence]
[word] [word, count] TextSpout SplitSentenceBolt [sentence] xyzBolt

servers architecture

Nimbus process responsible for distributing processing across the cluster

Supervisors worker process responsible for executing subset of topology

zookeepers coordination layer between Nimbus and Supervisors

fast CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS fail

sample code

Spouts public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random
_rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }

Bolts public static class WordCount extends BaseBasicBolt { Map<String, Integer>
counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }

Bolts public static class ExclamationBolt implements IRichBolt { OutputCollector _collector;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } }

Topology public class WordCountTopology { public static void main(String[] args)
throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Bolts public static class SplitSentence extends ShellBolt implements IRichBolt {
public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run()

github.com/nathanmarz/storm-starter

streams groupping

Topology public class WordCountTopology { public static void main(String[] args)
throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }

Groupping shuffle, fields, all, global, none, direct, local or shuffle

distributed rpc

RPC distributed arguments results [request-id, arguments] [request-id, results]

RPC distributed arguments results [request-id, arguments] [request-id, results] public static
class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); }

realtime analytics personalization search revenue optimization monitoring

content search realtime analytics generating feeds integrated with elastic search,
Hbase,hadoop and hdfs

realtime scoring moments generation integrated with kafka queues and hdfs
storage

Storm-YARN enables Storm applications to utilize the computational resources in
a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS

thanks! mail: mariusz@mariuszgil.pl twitter: @mariuszgil

Streams processing with Storm

Streams processing with Storm

More Decks by Mariusz Gil

Other Decks in Programming

Featured

Transcript