Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streams processing with Storm

Streams processing with Storm

REVISITED version

34be88398f623c109b61d23e8215bd23?s=128

Mariusz Gil

July 06, 2013
Tweet

Transcript

  1. Data streams processing with STORM Mariusz Gil

  2. None
  3. data expire fast. very fast

  4. None
  5. realtime processing?

  6. Storm is a free and open source distributed realtime computation

    system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
  7. Storm is fast, a benchmark clocked it at over a

    million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.
  8. concept architecture

  9. Stream (val1, val2) (val3, val4) (val5, val6) unbounded sequence of

    tuples tuple tuple tuple tuple tuple tuple tuple
  10. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple
  11. Reliable and unreliable Spouts replay or forget about touple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple
  12. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kafka
  13. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-Kestrel
  14. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-AMQP-Spout
  15. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-JMS
  16. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-PubSub*
  17. Spouts source of streams tuple tuple tuple tuple tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple Storm-Beanstalkd-Spout
  18. Bolts process input streams and produce new streams tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple
  19. Bolts process input streams and produce new streams tuple tuple

    tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple tuple
  20. Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence]

    [word] [word, count]
  21. Topologies network of spouts and bolts TextSpout SplitSentenceBolt WordCountBolt [sentence]

    [word] [word, count] TextSpout SplitSentenceBolt [sentence] xyzBolt
  22. servers architecture

  23. Nimbus process responsible for distributing processing across the cluster

  24. Supervisors worker process responsible for executing subset of topology

  25. zookeepers coordination layer between Nimbus and Supervisors

  26. fast CLUSTER STATE IS STORED LOCALLY OR IN ZOOKEEPERS fail

  27. sample code

  28. Spouts public class RandomSentenceSpout extends BaseRichSpout { SpoutOutputCollector _collector; Random

    _rand; @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { _collector = collector; _rand = new Random(); } @Override public void nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } @Override public void ack(Object id) { } @Override public void fail(Object id) { } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
  29. Bolts public static class WordCount extends BaseBasicBolt { Map<String, Integer>

    counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  30. Bolts public static class ExclamationBolt implements IRichBolt { OutputCollector _collector;

    public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } public void cleanup() { } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } public Map getComponentConfiguration() { return null; } }
  31. Topology public class WordCountTopology { public static void main(String[] args)

    throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
  32. Bolts public static class SplitSentence extends ShellBolt implements IRichBolt {

    public SplitSentence() { super("python", "splitsentence.py"); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } import storm class SplitSentenceBolt(storm.BasicBolt): def process(self, tup): words = tup.values[0].split(" ") for word in words: storm.emit([word]) SplitSentenceBolt().run()
  33. github.com/nathanmarz/storm-starter

  34. streams groupping

  35. Topology public class WordCountTopology { public static void main(String[] args)

    throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
  36. Groupping shuffle, fields, all, global, none, direct, local or shuffle

  37. distributed rpc

  38. RPC distributed arguments results [request-id, arguments] [request-id, results]

  39. RPC distributed arguments results [request-id, arguments] [request-id, results] public static

    class ExclaimBolt extends BaseBasicBolt { public void execute(Tuple tuple, BasicOutputCollector collector) { String input = tuple.getString(1); collector.emit(new Values(tuple.getValue(0), input + "!")); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("id", "result")); } } public static void main(String[] args) throws Exception { LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation"); builder.addBolt(new ExclaimBolt(), 3); LocalDRPC drpc = new LocalDRPC(); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc)); System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello")); cluster.shutdown(); drpc.shutdown(); }
  40. realtime analytics personalization search revenue optimization monitoring

  41. content search realtime analytics generating feeds integrated with elastic search,

    Hbase,hadoop and hdfs
  42. realtime scoring moments generation integrated with kafka queues and hdfs

    storage
  43. Storm-YARN enables Storm applications to utilize the computational resources in

    a Hadoop cluster along with accessing Hadoop storage resources such As HBase and HDFS
  44. thanks! mail: mariusz@mariuszgil.pl twitter: @mariuszgil