Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storm - an overview

Storm - an overview

Talk on Storm given by Oli Hall, Engineer, at MetaBroadcast on May 15th, 2013

MetaBroadcast

May 15, 2013
Tweet

More Decks by MetaBroadcast

Other Decks in Technology

Transcript

  1. Into the Storm An introduction to and overview of Apache

    Storm Oliver Hall, Engineer, MetaBroadcast
  2. What is Storm? • "free and open source distributed realtime

    computation platform" • tasks made from nodes, spread over multiple physical hosts • at-least-once guarantee for message processing • fault tolerant
  3. Who uses it? • Twitter • Groupon • Ooyala •

    Taobao • Alibaba • and, of course... MetaBroadcast
  4. What are we using it for? • labelling • statistics

    • impressions counting • and potentially much more...
  5. History • developed by Nathan Martz and BackType • acquired

    by Twitter • initial release in September 2011 • currently at version 0.8, still under development
  6. Overview • run on cluster of machines • consists of

    topologies (continuously running processing tasks) • cluster is a series of nodes ◦ master ◦ one or more workers
  7. Master Node • master node runs Nimbus • distributes code

    around the cluster • monitors for failures
  8. Worker Node • runs a Supervisor • listens for work

    assigned to their machine • starts / stops worker processes as necessary • each worker process runs sub-section of a topology • topology :- multiple worker processes spread across several machines
  9. Spouts • source of data • reliable or unreliable •

    can emit tuples to one or more streams data source data source ... data source
  10. Bolts • where all Storm processing occurs • filters, aggregations,

    functions, database calls, and more input tuples processing output tuples ... output tuples
  11. Topologies (again) • topologies tell bolts and spouts where to

    send their data • can parallelize every step N.B. you can define spouts, bolts and topologies in many languages, including non-JVM languages
  12. TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split",

    new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology());
  13. public class RandomSentenceSpout extends BaseRichSpout { ... @Override public void

    nextTuple() { Utils.sleep(100); String[] sentences = new String[] { "the cow jumped over the moon", "an apple a day keeps the doctor away", "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature"}; String sentence = sentences[_rand.nextInt(sentences.length)]; _collector.emit(new Values(sentence)); } ... @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } }
  14. public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts

    = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if(count==null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  15. Trident • a high-level abstraction on top of Storm •

    gives high-level throughput with stateful stream processing (think Pig or Cascading) • exactly-once message semantics
  16. Only-once • tuples are processed in small batches • each

    batch has a unique batch id • state updates are ordered
  17. Trident Topologies • a stream is channelled through a number

    of processing stages • each stage can be a filter, aggregation, function, or other similar process • sounds familiar? • individual steps are combined into spouts / bolts at runtime
  18. What are functions? • basic building blocks in trident input

    tuples output tuples processing public class Split extends BaseFunction { public void execute(TridentTuple tuple, TridentCollector collector) { String sentence = tuple.getString(0); for(String word: sentence.split(" ")) { collector.emit(new Values(word)); } } }
  19. Example Trident Topology TridentTopology topology = new TridentTopology(); TridentState wordCounts

    = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  20. DRPC • Distributed Remote Procedure Calls • executes an RPC

    across a Storm Cluster • query transformed into tuple, then flows through topology
  21. Trident State • means of persistence, either in memory or

    in store such as Cassandra • state updates are idempotent in the face of retries or failures
  22. Trident Spouts • can be one of three types ◦

    Non-transactional ◦ Transactional ◦ Opaque Transactional
  23. Achieving exactly-once semantics Spout Non- transactional Transactional Opaque Transactional State

    Non- transactional No No No Transactional No Yes Yes Opaque Transactional No No Yes
  24. Issues • Storm does have some negative points ◦ lack

    of documentation ◦ logging issues ◦ no testing framework for Trident ◦ rapidly changing • however, it is still early days
  25. Summary • Storm is a realtime scalable, resilient computation platform

    • Trident offers extremely good message guarantees • still an evolving technology - much may change
  26. Thank you! Any questions? images from the Storm Tutorial -

    https://github. com/nathanmarz/storm/wiki/Tutorial