Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storm 101

Storm 101

An introduction to Storm (http://storm-project.net) presented at the Data and Visualization Toronto meetup (http://www.meetup.com/Data-Visualization-Toronto/) in Dec 2012.

Avatar for Abhinav Ajgaonkar

Abhinav Ajgaonkar

December 12, 2012
Tweet

More Decks by Abhinav Ajgaonkar

Other Decks in Technology

Transcript

  1. What is Storm? •  Distributed realtime stream-computation framework •  Storm

    = Libraries + Runtime (kind of, not really) •  Libraries: storm.jar •  Runtime: storm binaries used on cluster nodes
  2. Who uses it and why? “We use several Storm topologies

    to ingest and persist weather data. Each topology is responsible for fetching one dataset from an internal or external network (the Internet), reshaping records for use by our company and persisting the records to a relational database” – The Weather Channel Source:  h*ps://github.com/nathanmarz/storm/wiki/Powered-­‐By    
  3. Who uses it and why? “Storm powers Twitter’s publisher analytics

    product, processing every tweet and click that happens on Twitter to provide analytics for Twitter’s publisher partners” Source: https://github.com/nathanmarz/storm/wiki/Powered-By
  4. Topology •  Can be visualized like a graph •  Container

    for application logic •  Analogous to a Map Reduce job •  Runs forever
  5. Tuple •  Main data structure in Storm •  Key-Value pairs

    – keys are strings, values can be any type •  Dynamically typed: value types do not need to be declared
  6. Streams •  Edges in the topology •  Unbounded sequence of

    tuples •  Defined with a schema –  Names of “fields” in the tuples being transported by the stream –  Values are dynamically typed •  Serializers for primitive types are provided •  Complex types require custom serializers
  7. Spouts •  Spouts are like sources in a graph • 

    Entry point for data – connect to data sources •  Inject tuples into the topology •  Tuples are emitted on streams •  Can output more than one stream •  Reliable or Unreliable
  8. Bolts •  Main workhorse of the topology •  Receives tuples

    from Spouts or other Bolts •  Can emit tuples to other Bolts •  Can do anything i.e. filtering, joins, aggregations, read from/write to databases, run arbitrary functions.. •  All sinks in the topology are bolts but not all bolts are sinks
  9. Stream Groupings •  Define how a stream should be partitioned

    amongst the various tasks (threads) of the target bolts •  Seven types of Groupings are bundled •  CustomStreamGrouping is available
  10. Shuffle Grouping •  Tuples are randomly distributed across the bolt’s

    tasks •  Each task is guaranteed to get an equal number of tuples •  Why would you use this?
  11. Fields Grouping •  The stream is partitioned by the fields

    specified by the grouping •  If the stream is partitioned by “user-id” then all tuples with the same user-id will go to the same instance of the bolt •  Tuples with different user-ids will go to different instances
  12. Other Groupings •  All Grouping •  Global Grouping •  None

    Grouping •  Direct Grouping •  Local or Shuffle Grouping
  13. Parallelism Stream Grouping Subscribe to this component Component ID No

    stream name has been provided which implies “default” stream name. Sample Topology
  14. The  Problem:  Find  Top  Tweets   •  Create  a  Spout

     that  connects  to  Twi*er  stream   •  Create  a  bolt  that  receives  tweets  from  Spout   –  IniJalize  top_tweet_retweets  =  50   –  If  (retweet_count  >  top_tweet_retweets)   •  Print  tweet  author,  tweet  text  and  retweet  count   •  Update  top_retweet_count   •  Bonus:  Keep  an  in-­‐memory  leaderboard  of  the  most   retweeted  tweets  in  past  5  minutes  
  15. Quick  Start   •  A  starJng  point   git clone

    https://github.com/abh1nav/dvto1 OR wget https://github.com/abh1nav/dvto1/archive/v0.1.zip •  What’s  in  it?   –  Twi*erSampleSpout:  connects  to  the  twi*er  API  and  emits  tweets   –  LogBolt:  logs  tweet  author  and  text  to  console   –  Topology:  connects  1  Twi*erSampleSpout  to  1  LogBolt  and  runs  locally     •  What  should  I  do?   –  Import  project  into  Eclipse  as  an  exisJng  Maven  project   –  Add  in  your  Twi*er  credenJals  to  Topology.java   –  Modify  LogBolt  to  complete  today’s  challenge