Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supercharge your workers with Storm

Supercharge your workers with Storm

Avatar for Carl Lerche

Carl Lerche

April 28, 2014
Tweet

Other Decks in Programming

Transcript

  1. @carllerche - Shooting for 40 minutes, so no time for

    questions. - AlthoughPretty nervous, so I might be done in 15... - Ask questions via twitter as they come up - Will have to hand wave over a bunch of stuff due to time constraints, if you want help finding more info about something I mention, ask Monday, April 28, 14
  2. - I work for Tilde - Building Skylight, a smart

    profiler for your rails app. - Used storm - back end data processing - Coordinating writes to cassandra - I will be at the booth tomorrow and thursday - Come talk about some real world storm usage Monday, April 28, 14
  3. WHAT IS STORM? - Well, what is storm? - Let’s

    start with how it describes itself on the website. Monday, April 28, 14
  4. A DISTRIBUTED REALTIME COMPUTATION SYSTEM - Wow, Sounds fancy... sounds

    serious business. - Took me a while to get into it, I didn’t think that it applied to me. Once I got to know it, it became obviously useful for many applications - I was debating about whether or not going over some use cases up front, but decided against it. - I’m hoping that by first trying to walk through some examples of using it - So, for now let’s just call it a really really powerful worker system Some highlights Monday, April 28, 14
  5. DISTRIBUTED - (Really distributed) - As in, the number of

    moving pieces is really high Monday, April 28, 14
  6. Zookeeper Zookeeper Zookeeper Rails app Distributed Database Distributed Queue Distributed

    Queue Distributed Queue Internet Storm Worker Storm Worker Storm Worker Zookeeper Zookeeper Storm Nimbus So, this one is kind of a pro and a con, because the operational aspect is not super easy. Monday, April 28, 14
  7. FAULT TOLERANT - Is able to recover and continue making

    progress in the event of errors. - A lot of systems claim this. Reality is, handling faults is really hard, but I think Storm is one of the few that handles this well, and I will go into some more detail later. Monday, April 28, 14
  8. REALLY FAST - Pretty low overhead - Pretty significantly threaded

    - Coordination between threads is really good - Useless number: over 1 million messages per sec per node Monday, April 28, 14
  9. LANGUAGE AGNOSTIC supposedly So, in THEORY, you can use storm

    with the language of your choosing. I’ve even seen examples that used bash. In practice, I don’t know how well it really works, Monday, April 28, 14
  10. JVM - Assume JRuby - Part of the hand waving

    - I can’t really go into the detail of how to set everything up. Monday, April 28, 14
  11. TWITTER TRENDING TOPICS - Straw man - Everybody tweets, sometimes

    they include hashtags. Trending topics bubble up hashtags that are occurring at the highest rates. - Time sensitive (real time) Monday, April 28, 14
  12. EXPONENTIALLY WEIGHTED MOVING AVERAGES - For this example, I am

    going to use exponentially weighted moving averages to calculate the rate at which hashtags are being used. - For each hashtag, count the number of occurrences each 5 seconds, then average that number. - Instead of just doing your normal sum all the values and divide by the number of values, we are going to weigh older values exponentially less - Fun fact, linux uses EWMA for calculating the 1m, 5m, and 15min values for CPU load. Monday, April 28, 14
  13. Rails app Internet Queue (Redis) Worker DB - Start w/

    how this might be implemented using Reque or Sidekiq. - To be clear, I’m not putting either of these projects down, we use them. I’m just using them to try to illustrate some problems that storm solves. Monday, April 28, 14
  14. class TweetWorker include Sidekiq::Worker # Yes, I know this is

    naive and the number of # queries could be reduced. def perform(tweet) tags = extract_hashtags(tweet.body) tags.each do |hashtag| existing = HashTag.find_or_new_by_name(hashtag) existing.update_ewma(Time.now) existing.save! end end end Monday, April 28, 14
  15. class TweetWorker include Sidekiq::Worker # Yes, I know this is

    naive and the number of # queries could be reduced. def perform(tweet) tags = extract_hashtags(tweet.body) tags.each do |hashtag| existing = HashTag.find_or_new_by_name(hashtag) existing.update_ewma(Time.now) existing.save! end end end Monday, April 28, 14
  16. class TweetWorker include Sidekiq::Worker # Yes, I know this is

    naive and the number of # queries could be reduced. def perform(tweet) tags = extract_hashtags(tweet.body) tags.each do |hashtag| existing = HashTag.find_or_new_by_name(hashtag) existing.update_ewma(Time.now) existing.save! end end end Monday, April 28, 14
  17. class HashTag def update_ewma(now) catchup self.uncounted += 1 end def

    catchup(now) tick until time >= now.to_i end def tick interval = 5 # in seconds # Compute the rate this interval (aka the num # of occurences this tick) instant_rate = uncounted / interval # Reset the count self.uncounted = 0 self.rate += ALPHA * (instant_rate - self.rate) self.time += interval end Monday, April 28, 14
  18. TWEETS FROM A GIVEN HASHTAG STOP - There is a

    problem. Our EWMA algorithm requires us to update the rate value of the hashtag every 5 seconds. - This works as long as there are tweets that arrive containing the hashtag. However, what if that isn’t the case? We need to run another job to ensure that the hashtags keep getting their rate values updated even when no tweets arrive. Monday, April 28, 14
  19. class CleanupWorker include Sidekiq::Worker def perform now = Time.now.to_i HashTag.delete_all("time

    < ?", now - CUTOFF) tags = HashTag.where("time < ?", now).all tags.each do |hashtag| hashtag.catchup(now) hashtag.save! end end end - Cool, this should conceptually work. - Though, I haven’t actually ran any of this code. - There is one more super important question to ask Monday, April 28, 14
  20. IS IT WEB SCALE YET? - Always the most important

    question - You got to be sure that when your app goes viral, you can handle the load. - No worries, we can scale out the workers Monday, April 28, 14
  21. Rails app Internet Queue (Redis) Worker DB Worker Worker -

    Alright, now we’re talking. Got 3 workers going, are we ready to handle twitter’s 50k+ ps tweet firehose? - Well.... maybe not quite. But no worries, we got more tricks up our sleeves. - Let’s add some caching. - We’re going to cache the hashtag records in memory in each worker. - Everybody knows caching is easy... Monday, April 28, 14
  22. class TweetWorker include Sidekiq::Worker def initialize @hashtags end def perform(tweet)

    tags = extract_hashtags(tweet.body) tags.each do |hashtag| unless existing = @hashtags[hashtag] @hashtags[hashtag] = HashTag.new_by_name(hashtag) existing = @hashtags[hashtag] end existing.update_ewma(Time.now) existing.save! end end end Monday, April 28, 14
  23. Rails app Internet Queue (Redis) Worker DB Worker Worker @carllerche

    On stage at #railsconf. OMG so nervous... Monday, April 28, 14
  24. Rails app Internet Queue (Redis) Worker DB Worker Worker @carllerche

    On stage at #railsconf. OMG so nervous... #railsconf Monday, April 28, 14
  25. Rails app Internet Queue (Redis) Worker DB Worker Worker @carllerche

    On stage at #railsconf. OMG so nervous... #railsconf Monday, April 28, 14
  26. Rails app Internet Queue (Redis) Worker DB Worker Worker @carllerche

    On stage at #railsconf. OMG so nervous... count 1 count 1 Monday, April 28, 14
  27. Rails app Internet Queue (Redis) Worker DB Worker Worker @carllerche

    On stage at #railsconf. OMG so nervous... count 1 count 1 Monday, April 28, 14
  28. Rails app Internet Queue (Redis) Worker DB Worker Worker @sferik

    What’s for lunch at #railsconf? count 1 count 1 Monday, April 28, 14
  29. Rails app Internet Queue (Redis) Worker DB Worker Worker @sferik

    What’s for lunch at #railsconf? #railsconf count 1 count 1 Monday, April 28, 14
  30. Rails app Internet Queue (Redis) Worker DB Worker Worker @sferik

    What’s for lunch at #railsconf? #railsconf count 1 count 1 Monday, April 28, 14
  31. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    1 #railsconf count 1 count 1 @sferik What’s for lunch at #railsconf? Monday, April 28, 14
  32. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    1 #railsconf count 1 count 1 @sferik What’s for lunch at #railsconf? Monday, April 28, 14
  33. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    1 count 2 count 1 @sferik What’s for lunch at #railsconf? count 2 Monday, April 28, 14
  34. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    1 count 2 count 1 @sferik What’s for lunch at #railsconf? count 2 Monday, April 28, 14
  35. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 1 @tomdale Just landed! #railsconf? count 2 Monday, April 28, 14
  36. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 1 @tomdale Just landed! #railsconf? count 2 #railsconf Monday, April 28, 14
  37. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 1 @tomdale Just landed! #railsconf? count 2 #railsconf Monday, April 28, 14
  38. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 2 @tomdale Just landed! #railsconf? count 2 count 2 Monday, April 28, 14
  39. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 2 @tomdale Just landed! #railsconf? count 2 count 2 Monday, April 28, 14
  40. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 2 @tomdale Just landed! #railsconf? count 2 count 2 - Can’t cache hashtags - This has to do with how these systems work. Workers pop the next available message from the queue and process it. - Workers are assumed to bootstrap their state each time. - We could probably reduce each worker to effectively run a single (large) SQL query, but that would still require a SQL query for each tweet. - Punting coordination to the database, and that’s where the bottleneck will end up. Even though we have many parallel workers, the database can only process one update at a time. - There are still many things we can do to fix this up. Monday, April 28, 14
  41. Rails app Internet Queue (Redis) Worker DB Worker Worker count

    2 count 2 count 2 @tomdale Just landed! #railsconf? count 2 count 2 - Can’t cache hashtags - This has to do with how these systems work. Workers pop the next available message from the queue and process it. - Workers are assumed to bootstrap their state each time. - We could probably reduce each worker to effectively run a single (large) SQL query, but that would still require a SQL query for each tweet. - Punting coordination to the database, and that’s where the bottleneck will end up. Even though we have many parallel workers, the database can only process one update at a time. - There are still many things we can do to fix this up. Monday, April 28, 14
  42. ENTER STORM - Solves all the problems we were having

    - That is what storm is trying to do - Make these sorts of problems easier. - Let’s start diving in. - going to start by going over some more abstract concepts, but hopefully I’ll be able to tie it together with examples Monday, April 28, 14
  43. STREAMS / TUPLES - Storm really is just a series

    of tubes through which data gets piped, but storm calls the pipes streams and the data tuples. - A tuple is just a list of values. The values can be anything you want. Strings, integers, or objects of any complexity. The only limitation is that you can serialize them. You can define custom serializers for any type of object. I’m not going to get too much into the specifics of serialization though. - The bulk of storm is just a set of primitives to transform the streams of data. Monday, April 28, 14
  44. SPOUT / STATE Spouts are the source of the streams.

    They are the entry point into storm. Anything that reads from the outside world - Read from queues (Redis, SQS, etc..) - Read directly from the twitter API - Read from databases - HTTP Get requests - Time of day State objects are the opposite. They are the stream “endpoints” They allow the results of the data transformations to be available outside of storm. - Anything that “writes” outside of storm - Writes to the DB - HTTP POST requests - Pushing to external queues - Sending email Monday, April 28, 14
  45. - Inject data transformations into the stream - But, we

    haven’t done anything interesting yet. - There is no point really to just read the data in one end and write it as is out the other end. Spout State Stream Monday, April 28, 14
  46. TRANSFORMS - I don’t think this is the official name

    - Purely functional operations on data - Reads data in on an input stream, emits results on an output stream. Monday, April 28, 14
  47. Spout State Filter Aggregate - So far, this is where

    we are at. - We have a spout that feeds data in - We can run it through some transforms - The data flows through and ends up at a state, where it will exit storm somehow, usually by being written to a database. Monday, April 28, 14
  48. MORE TRANSFORMS - The fun is just getting started -

    You can model quite complex data flow Monday, April 28, 14
  49. Spout 1 State Filter Aggregate Spout 2 Map Map Aggregate

    Filter Join Join State - Add annotations for filter (1 tweet per user) - Aggregate by hashtag - We’ll look at how to write these and how to hook them all together Monday, April 28, 14
  50. TOPOLOGY - The directed graph of spouts to states via

    transforms is called a topology - Represents the execution - I’m not going to talk much about deployment, but basically, you define this topology Monday, April 28, 14
  51. Spout Filter Let’s break it down Usually, the spout is

    provided via OSS packages - Redis spout, SQS spout, Kestrel, Kafka, etc... - There already are libraries of provided transformations - Transformations can be made generic, packaged up, reused, and shared - Instead of listing all the available spouts, I’m going to show how to implement them Monday, April 28, 14
  52. class MyFilter < BaseFunction def execute(tuple, output) msg = tuple.get_value_by_field("msg")

    if msg.awesome? output.emit(Values.new(msg)) end end end Monday, April 28, 14
  53. class MyFilter < BaseFunction def execute(tuple, output) msg = tuple.get_value_by_field("msg")

    if msg.awesome? output.emit(Values.new(msg)) end end end Monday, April 28, 14
  54. class MyFilter < BaseFunction def execute(tuple, output) msg = tuple.get_value_by_field("msg")

    if msg.awesome? output.emit(Values.new(msg)) end end end Monday, April 28, 14
  55. DEFINE THE TOPOLOGY The directed graph of spouts to states

    via transforms is called a topology Monday, April 28, 14
  56. def define_topology(topology) topology. new_stream("my-spout", MyQueueSpout.new(TOPIC)). each(f("bytes"), MyQueueMsgDeserializer.new, f("msg")). each(f("msg"), MyFilter.new,

    f("msg")). each(f("msg"), MyLogger.new, f()) end def f(*names) Fields.new(*names) end class MyQueueMsgDeserializer < BaseFunction def execute(tuple, output) bytes = tuple.get_value_by_field("bytes") msg = Msg.new(JSON.parse(bytes)) output.emit(Values.new(msg)) end end Monday, April 28, 14
  57. RUN IT The easiest way to get running is by

    running everything locally. Bundles zookeeper in process Monday, April 28, 14
  58. def run cluster = LocalCluster.new topology = TridentTopology.new config =

    Config.new define_topology(topology) cluster.submit_topology("my-topology", config, topology) end Monday, April 28, 14
  59. WINNING but only a little, not all that interesting yet.

    All our magnificently filtered data is getting dumped into the void Monday, April 28, 14
  60. Spout Filter State - We want to persist all the

    results - Again, I’m going to jump into the low level implementation of a state. There are higher level ones that can automatically persist to memcached or cassandra, or riak, or anything Monday, April 28, 14
  61. class MyBasicState < State def begin_commit(transaction_id) end def commit(transaction_id) end

    def persist_awesomely(msg) awesome_msg = MyAwesomeMsg.new(msg) awesome_msg.save! end end Only begin_commit / commit are required by the state Monday, April 28, 14
  62. class MyBasicUpdater < BaseStateUpdater def update_state(my_basic_state, input, output) msg =

    input.get_value_by_field("msg") my_basic_state.persist_awesomely(msg) end end - Define a state updater. - This is what receives the tuples off the stream and writes them to the state. Monday, April 28, 14
  63. Tweet Spout Extract Hashtags State Aggregate Let’s go back to

    our trending topics example - Aggregate function is going to get all the hashtag tuples and aggregate them into a count Monday, April 28, 14
  64. class ExtractHashTags < BaseFunction def execute(tuple, output) tweet = tuple.get_value_by_field("tweet")

    extract_hashtags(tweet).each do |hashtag| output.emit(Values.new(hashtag)) end end end Monday, April 28, 14
  65. class HashTagAggregator < BaseAggregator def init(transaction_id, output) {} # initial

    summary end def aggregate(summary, tuple, output) hashtag = tuple.get_value_by_field("hashtag") summary[hashtag] ||= 0 hashtag += 1 end def complete(summary, output) summary.each do |hashtag, count| output.emit(Values.new(hashtag, count)) end end end Monday, April 28, 14
  66. topology. new_stream("my-spout", MyQueueSpout.new(TOPIC)). each( f("bytes"), MyQueueMsgDeserializer.new, f("username", "tweet")). each(f("tweet"), ExtractHashTags.new,

    f("hashtag")). partition_aggregate( f("hashtag"), HashTagAggregator.new, f("hashtag", "count")) partition_persist( TrendingTopicState.factory, f("hashtag", "count"), TrendingTopicUpdater.new) Monday, April 28, 14
  67. class TrendingTopicState < State def begin_commit(transaction_id) end def commit(transaction_id) end

    def update(hashtag, count) existing = HashTag.find_or_new_by_name(hashtag) existing.update_ewma(count, Time.now) existing.save! end end class TrendingTopicUpdater < BaseStateUpdater def update_state(state, input, output) hashtag = input.get_value_by_field("hashtag") count = input.get_value_by_field("count") state.update(hashtag, count) end Monday, April 28, 14
  68. class HashTagAggregator < BaseAggregator def init(transaction_id, output) {} # initial

    summary end def aggregate(summary, tuple, output) hashtag = tuple.get_value_by_field("hashtag") summary[hashtag] ||= 0 hashtag += 1 end def complete(summary, output) summary.each do |hashtag, count| output.emit(Values.new(hashtag, count)) end end end When does this run? Streams are an unbounded sequence of tuples, so when is it “complete”? Monday, April 28, 14
  69. EXECUTION IN BATCHES Batch determined by spout. Could be 1

    tuple, could be 1MM tuples. More is generally better. The spout will get fetch a number of messages to make the batch and send it downstream. Aggregation completion happens at the end of the batch (after the aggregation transform has seen all tuples in the batch) State begin_commit / commit Monday, April 28, 14
  70. Tweet Spout Extract Hashtags State Aggregate Well, one thing that

    stands out is that it looks like we are sending all tweets down a single stream, seems problematic? Monday, April 28, 14
  71. Spout Transform State Let’s get back to basics. This is

    what I showed, but this is only what is only conceptually what is happening. Monday, April 28, 14
  72. What really happens is that streams are broken up into

    N partitions Spout Transform State PARTITION PARTITION PARTITION PARTITION PARTITION PARTITION PARTITION PARTITION PARTITION Monday, April 28, 14
  73. Spout Aggregate State Ideally, this is what we want to

    happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  74. Spout Aggregate State #railsconf Ideally, this is what we want

    to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  75. Spout Aggregate State #railsconf Ideally, this is what we want

    to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  76. Spout Aggregate State #railsconf Ideally, this is what we want

    to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  77. Spout Aggregate State #railsconf #railsconf Ideally, this is what we

    want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  78. Spout Aggregate State #railsconf #railsconf Ideally, this is what we

    want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  79. Spout Aggregate State #railsconf #railsconf Ideally, this is what we

    want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  80. Spout Aggregate State #railsconf #railsconf #sleep Ideally, this is what

    we want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  81. Spout Aggregate State #railsconf #railsconf #sleep Ideally, this is what

    we want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  82. Spout Aggregate State #railsconf #railsconf #sleep Ideally, this is what

    we want to happen It’s important to note that it is OK to send all tuples w/ the same hashtag to the same state partition. The state partition will not be overloaded because we first aggregated. Worst case scenario, the state will receive N tuples per hashtag where N is the number of partitions Monday, April 28, 14
  83. each(f("tweet"), ExtractHashTags.new, f("hashtag")). partition_aggregate( f("hashtag"), HashTagAggregator.new, f("hashtag", "count")). partition(f("hashtag"). partitionPersist(

    TrendingTopicFactory.new, f("hashtag", "count"), TrendingTopicUpdater.new) partition ensures that all tuples with the same hashtag get persisted on the same server Monday, April 28, 14
  84. HOW MANY PARTITIONS? Arbitrary number, it’s configurable, so tweak for

    your use case. Maybe 8 is appropriate, maybe 512 is. Knowing how a topology gets executed on the cluster might help Let’s talk about that real quick Monday, April 28, 14
  85. MANY PARTITIONS PER THREAD Arbitrary number, it’s configurable, so tweak

    for your use case. Maybe 8 is appropriate, maybe 512 is. Monday, April 28, 14
  86. Server Server Server Thread Thread Thread Thread Thread Thread Thread

    Thread Thread P P P P P P P P P P P P P P P P P P P P P P P P P P P Arbitrary number, it’s configurable, so tweak for your use case. Maybe 8 is appropriate, maybe 512 is. Monday, April 28, 14
  87. Server Server Server Thread Thread Thread Thread Thread Thread Thread

    Thread Thread P P P P P P P P P P P P P P P P P P P P P P P P P P P Arbitrary number, it’s configurable, so tweak for your use case. Maybe 8 is appropriate, maybe 512 is. Monday, April 28, 14
  88. Arbitrary number, it’s configurable, so tweak for your use case.

    Maybe 8 is appropriate, maybe 512 is. Server Server Thread Thread Thread Thread Thread Thread P P P P P P P P P P P P P P P P P P P P P P P P P P P Monday, April 28, 14
  89. Server Server Server Thread Thread Thread Thread Thread Thread Thread

    Thread Thread P P P P P P P P P P P P P P P P P P P P P P P P P P P Arbitrary number, it’s configurable, so tweak for your use case. Maybe 8 is appropriate, maybe 512 is. New Server Monday, April 28, 14
  90. Server Server Server Thread Thread Thread Thread Thread Thread Thread

    Thread Thread P P P P P P P P P P P P P P P P P P P P P Server Thread Thread Thread P P P P P P (unlike my slide, which looks pretty off balanced) STORM REBALANCES Monday, April 28, 14
  91. FAILURE - The question is not will there be failure

    - The question is how do we handle it - Will we recover? - Will our system end up inconsistent? - Will we lose availability? - Handling failure is probably the hardest part of building a robust distributed system. - We have all built distributed systems. - If you have built a rails app, then you have built a distributed system. - The browser talks to the server which talks to the database. - Failure can happen at any stage - Consider a signup form, what happens if the user hits submit and the request failed? - Did the request reach the rails app? - Did the rails app start processing it? - At what point did the request fail? - Was it before writing to the DB? - Was it after? - If the user attempts to signup again, what will happen? - Will the user be successful? - Will the user get an error stating that there already is an account with the given email address? - How can we, as developers, prevent this from happening? - I bring up such a “simple” case, because it just gets more complex from here. Monday, April 28, 14
  92. Input Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple

    Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple - This is more what tends to happen - You get one input tuple, and throughout the processing, it spawns off more tuples - There isn’t really any upper bound to how many tuples can be fanned out from an original source Monday, April 28, 14
  93. Input Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple

    Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple Tuple - So what do you do in this case? - Do you track the status of every single message? The book keeping will be huge. - Things get even messier when you do joins. - AKA, when a tuple has more than one parent. - What happens then? You can’t back out of the processing Monday, April 28, 14
  94. class TweetWorker include Sidekiq::Worker # Yes, I know this is

    naive and the number of # queries could be reduced. def perform(tweet) user = User.find_by_username(tweet.user) tags = user.new_hashtags_mentioned_this_hour(tweet.body) tags do |hashtag| existing = HashTag.find_or_new_by_name(hashtag) existing.update_ewma(Time.now) existing.save! end end end BOOM Monday, April 28, 14
  95. WHAT NOW? - How do we recover from this? -

    The message is half processed Monday, April 28, 14
  96. MESSAGE PROCESSING GUARANTEES - When evaluating a system, thinking about

    the message processing guarantees is important. - At least once - At most once - Something else? Monday, April 28, 14
  97. MONOTONICALLY INCREASING BATCH IDS - Combine that with the fact

    that storm will not start another batch Monday, April 28, 14
  98. class TrendingTopicState < State def begin_commit(transaction_id) @txid = transaction_id @hashtags

    ||= load_hashtags_for(partition_ids) end def commit(transaction_id) tick_all_tags(transaction_id) delete_stale_hashtags end def update(name, count, timestamp) hashtag = (@hashtags[name] ||= HashTag.new) return if hashtag.last_txid == @txid hashtag.update_ewma(count, timestamp) end end Monday, April 28, 14
  99. class TrendingTopicState < State def begin_commit(transaction_id) @txid = transaction_id @hashtags

    ||= load_hashtags_for(partition_ids) end def commit(transaction_id) tick_all_tags(transaction_id) delete_stale_hashtags end def update(name, count, timestamp) hashtag = (@hashtags[name] ||= HashTag.new) return if hashtag.last_txid == @txid hashtag.update_ewma(count, timestamp) end end Monday, April 28, 14
  100. class TrendingTopicState < State def begin_commit(transaction_id) @txid = transaction_id @hashtags

    ||= load_hashtags_for(partition_ids) end def commit(transaction_id) tick_all_tags(transaction_id) delete_stale_hashtags end def update(name, count, timestamp) hashtag = (@hashtags[name] ||= HashTag.new) return if hashtag.last_txid == @txid hashtag.update_ewma(count, timestamp) end end Monday, April 28, 14
  101. class TrendingTopicState < State def begin_commit(transaction_id) @txid = transaction_id @hashtags

    ||= load_hashtags_for(partition_ids) end def commit(transaction_id) tick_all_tags(transaction_id) delete_stale_hashtags end def update(name, count, timestamp) hashtag = (@hashtags[name] ||= HashTag.new) return if hashtag.last_txid == @txid hashtag.update_ewma(count, timestamp) end end Monday, April 28, 14
  102. Tweet Spout Extract Hashtags State Aggregate Storm IT’S COOL. I’LL

    RE- EMIT THE SAME 200 TUPLES Monday, April 28, 14
  103. REQUIRES TRANSFORMS TO BE PURELY FUNCTIONAL - So, don’t actually

    use Time.now in your transforms. - There are a few options - You can make a spout that’s entire job is to get Time.now for a batch ID and save it somewhere so that if the batch is re- emmitted, it reemits the same value Another option: Batch time dependent on input. Each tweet probably has a time associated with it, so write some function of all input tweets to compute a value for “now” Monday, April 28, 14
  104. REQUIRES THE SPOUT TO RE-EMIT IDENTICAL BATCHES - This one

    is trickier - Most queues, once you pop from the queue or read the message, it is gone. Monday, April 28, 14
  105. KAFKA - Really awesome - More of a commit log

    than traditional queue - Send messages to it, it appends them to a queue - The application has a cursor, starts at zero, and the application asks for messages from Kafka at that cursor. - It’s the application’s job to manage the cursor - One queue reader? Since no coordination - Kafka queues are very partitioned. Each reader reads a different partition. - Fits in well with storm since storm spouts are partitioned. Each storm partition reads from it’s own kafka partition - Nice properties: Replay Monday, April 28, 14
  106. GOOD FOR COMPLEX DATA PROCESSING FLOWS - Everything I talked

    about is specifically about the Trident part of storm. - I also entirely focused on what storm calls transactional topologies Monday, April 28, 14