Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streams processing with Storm

Streams processing with Storm

REVISITED version

Mariusz Gil

July 06, 2013
Tweet

More Decks by Mariusz Gil

Other Decks in Programming

Transcript

  1. Data streams
    processing with
    STORM
    Mariusz Gil

    View Slide

  2. View Slide

  3. data expire fast. very fast

    View Slide

  4. View Slide

  5. realtime processing?

    View Slide

  6. Storm is a free and open source distributed realtime
    computation system. Storm makes it easy to reliably
    process unbounded streams of data, doing for realtime
    processing what Hadoop did for batch processing.

    View Slide

  7. Storm is fast, a benchmark clocked it at over a million
    tuples processed per second per node. It is scalable,
    fault-tolerant, guarantees your data will be processed,
    and is easy to set up and operate.

    View Slide

  8. concept architecture

    View Slide

  9. Stream
    (val1, val2)
    (val3, val4)
    (val5, val6)
    unbounded sequence of tuples
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple

    View Slide

  10. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple

    View Slide

  11. Reliable and unreliable Spouts
    replay or forget about touple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple

    View Slide

  12. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-Kafka

    View Slide

  13. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-Kestrel

    View Slide

  14. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-AMQP-Spout

    View Slide

  15. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-JMS

    View Slide

  16. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-PubSub*

    View Slide

  17. Spouts
    source of streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    Storm-Beanstalkd-Spout

    View Slide

  18. Bolts
    process input streams and produce new streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple

    View Slide

  19. Bolts
    process input streams and produce new streams
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple
    tuple

    View Slide

  20. Topologies
    network of spouts and bolts
    TextSpout SplitSentenceBolt WordCountBolt
    [sentence] [word] [word, count]

    View Slide

  21. Topologies
    network of spouts and bolts
    TextSpout SplitSentenceBolt
    WordCountBolt
    [sentence]
    [word]
    [word, count]
    TextSpout SplitSentenceBolt
    [sentence]
    xyzBolt

    View Slide

  22. servers architecture

    View Slide

  23. Nimbus
    process responsible for distributing processing across the cluster

    View Slide

  24. Supervisors
    worker process responsible for executing subset of topology

    View Slide

  25. zookeepers
    coordination layer between Nimbus and Supervisors

    View Slide

  26. fast
    CLUSTER STATE IS STORED
    LOCALLY OR IN ZOOKEEPERS
    fail

    View Slide

  27. sample code

    View Slide

  28. Spouts
    public class RandomSentenceSpout extends BaseRichSpout {
    SpoutOutputCollector _collector;
    Random _rand;
    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
    _collector = collector;
    _rand = new Random();
    }
    @Override
    public void nextTuple() {
    Utils.sleep(100);
    String[] sentences = new String[] {
    "the cow jumped over the moon",
    "an apple a day keeps the doctor away",
    "four score and seven years ago",
    "snow white and the seven dwarfs",
    "i am at two with nature"};
    String sentence = sentences[_rand.nextInt(sentences.length)];
    _collector.emit(new Values(sentence));
    }
    @Override
    public void ack(Object id) {
    }
    @Override
    public void fail(Object id) {
    }
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("word"));
    }
    }

    View Slide

  29. Bolts
    public static class WordCount extends BaseBasicBolt {
    Map counts = new HashMap();
    @Override
    public void execute(Tuple tuple, BasicOutputCollector collector) {
    String word = tuple.getString(0);
    Integer count = counts.get(word);
    if (count == null) count = 0;
    count++;
    counts.put(word, count);
    collector.emit(new Values(word, count));
    }
    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("word", "count"));
    }
    }

    View Slide

  30. Bolts
    public static class ExclamationBolt implements IRichBolt {
    OutputCollector _collector;
    public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
    _collector = collector;
    }
    public void execute(Tuple tuple) {
    _collector.emit(tuple, new Values(tuple.getString(0) + "!!!"));
    _collector.ack(tuple);
    }
    public void cleanup() {
    }
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("word"));
    }
    public Map getComponentConfiguration() {
    return null;
    }
    }

    View Slide

  31. Topology
    public class WordCountTopology {
    public static void main(String[] args) throws Exception {
    TopologyBuilder builder = new TopologyBuilder();
    builder.setSpout("spout", new RandomSentenceSpout(), 5);
    builder.setBolt("split", new SplitSentence(), 8)
    .shuffleGrouping("spout");
    builder.setBolt("count", new WordCount(), 12)
    .fieldsGrouping("split", new Fields("word"));
    Config conf = new Config();
    conf.setDebug(true);
    if (args != null && args.length > 0) {
    conf.setNumWorkers(3);
    StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
    } else {
    conf.setMaxTaskParallelism(3);
    LocalCluster cluster = new LocalCluster();
    cluster.submitTopology("word-count", conf, builder.createTopology());
    Thread.sleep(10000);
    cluster.shutdown();
    }
    }
    }

    View Slide

  32. Bolts
    public static class SplitSentence extends ShellBolt implements IRichBolt {
    public SplitSentence() {
    super("python", "splitsentence.py");
    }
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("word"));
    }
    }
    import storm
    class SplitSentenceBolt(storm.BasicBolt):
    def process(self, tup):
    words = tup.values[0].split(" ")
    for word in words:
    storm.emit([word])
    SplitSentenceBolt().run()

    View Slide

  33. github.com/nathanmarz/storm-starter

    View Slide

  34. streams groupping

    View Slide

  35. Topology
    public class WordCountTopology {
    public static void main(String[] args) throws Exception {
    TopologyBuilder builder = new TopologyBuilder();
    builder.setSpout("spout", new RandomSentenceSpout(), 5);
    builder.setBolt("split", new SplitSentence(), 8)
    .shuffleGrouping("spout");
    builder.setBolt("count", new WordCount(), 12)
    .fieldsGrouping("split", new Fields("word"));
    Config conf = new Config();
    conf.setDebug(true);
    if (args != null && args.length > 0) {
    conf.setNumWorkers(3);
    StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
    } else {
    conf.setMaxTaskParallelism(3);
    LocalCluster cluster = new LocalCluster();
    cluster.submitTopology("word-count", conf, builder.createTopology());
    Thread.sleep(10000);
    cluster.shutdown();
    }
    }
    }

    View Slide

  36. Groupping
    shuffle, fields, all, global, none, direct, local or shuffle

    View Slide

  37. distributed rpc

    View Slide

  38. RPC
    distributed
    arguments
    results
    [request-id, arguments]
    [request-id, results]

    View Slide

  39. RPC
    distributed
    arguments
    results
    [request-id, arguments]
    [request-id, results]
    public static class ExclaimBolt extends BaseBasicBolt {
    public void execute(Tuple tuple, BasicOutputCollector collector) {
    String input = tuple.getString(1);
    collector.emit(new Values(tuple.getValue(0), input + "!"));
    }
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("id", "result"));
    }
    }
    public static void main(String[] args) throws Exception {
    LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("exclamation");
    builder.addBolt(new ExclaimBolt(), 3);
    LocalDRPC drpc = new LocalDRPC();
    LocalCluster cluster = new LocalCluster();
    cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));
    System.out.println("Results for 'hello':" + drpc.execute("exclamation", "hello"));
    cluster.shutdown();
    drpc.shutdown();
    }

    View Slide

  40. realtime analytics
    personalization
    search
    revenue
    optimization
    monitoring

    View Slide

  41. content search
    realtime analytics
    generating feeds
    integrated with
    elastic search,
    Hbase,hadoop
    and hdfs

    View Slide

  42. realtime scoring
    moments generation
    integrated with
    kafka queues and
    hdfs storage

    View Slide

  43. Storm-YARN enables
    Storm applications to
    utilize the
    computational
    resources in a Hadoop
    cluster along with
    accessing Hadoop
    storage resources
    such As HBase and
    HDFS

    View Slide

  44. thanks!
    mail: [email protected]
    twitter: @mariuszgil

    View Slide