Storm: the Hadoop of Realtime Stream Processing

Storm: the Hadoop of Realtime Stream Processing

Twitter's new scalable, fault-tolerant, and simple(ish) stream programming system... with Python!

592c29d1042d5f0da1524ff17bbe25da?s=128

Gabriel Grant

March 25, 2012
Tweet

Transcript

  1. STORM Keeping it Real(time) Since 2011

  2. HELLO.

  3. dotCloud.com

  4. DATA

  5. DATA

  6. MEGA-DATA

  7. VERSION ONE

  8. VERSION TWO

  9. VERSION TWO

  10. VERSION THREE

  11. JOY

  12. VERSION FOUR?

  13. ENTER, STORM

  14. REAL-TIME COMPUTATION

  15. DISTRIBUTED RPC & STREAM PROCESSING

  16. HISTORY

  17. STREAM PROCESSING

  18. STORM:REAL-TIME HADOOP:BATCH

  19. WOW

  20. HIGH VOLUME

  21. CONTINUOUS

  22. CONTINUOUS

  23. FAULT TOLERANT

  24. DOESN'T

  25. PERSIST

  26. PROCESS BATCHES RELIABLY

  27. PROTECT AGAINST HUMAN ERROR

  28. PROTECT AGAINST HUMAN ERROR

  29. THREE CORE ELEMENTS

  30. SPOUTS

  31. STREAMS

  32. BOLTS

  33. TOPOLOGIES

  34. TASKS

  35. TASKS

  36. OUTPUT ROUTING?

  37. STREAM GROUPINGS

  38. SHUFFLE GROUPING

  39. FIELDS GROUPING

  40. ALL GROUPING

  41. GLOBAL GROUPING

  42. DOWN 'N DIRTY

  43. GATEWAYS

  44. GATEWAYS

  45. REAL-TIME GEOCODE BUCKETED CLIENT UPDATE

  46. THE TOPOLOGY

  47. THE TOPOLOGY

  48. CODE TIME: START ECLIPSE

  49. WAIT, WHAT?!

  50. MULTILANG API

  51. I'VE GOT YOU COVERED

  52. UMBRELLA: IT PROTECTS YOU FROM STORM

  53. THE TOPOLOGY

  54. I'VE GOT YOU COVERED class RedisSpout(JVMSpout): class Default(Stream): fields =

    'message' jvm_class = 'yieldbot.storm.spout'
  55. I'VE GOT YOU COVERED class LogParserBolt(AutoAckBolt): class Default(Stream): fields =

    'ip_address' def execute(self, input): ip_address = parse_log(input.message) self.emit(ip_address)
  56. I'VE GOT YOU COVERED class GeolocatorBolt(AutoAckBolt): class Default(Stream): fields =

    'lat', 'long' def __init__(self, *args, **kwargs): self.geoip = pygeoip.GeoIP('GeoLiteCity.dat') super(GeolocatorBolt, self) \ .__init__(*args, **kwargs) def execute(self, input): record = self.geoip.record_by_addr(input.ip) lat = record['latitude'] long_ = record['longitude'] self.emit((lat, long_))
  57. I'VE GOT YOU COVERED class WSPuserBolt(Bolt): def __init__(self, *args, **kwargs):

    self.batcher = TimeBatcher() self.pusher = zerorpc.Client(timeout=None) url = os.environ['WSPUSHER_ZERORPC_URL'] self.wspusher.connect(url) super(WSPusherBolt, self).__init__(*args, **kwargs def execute(self, input): t = time() batch = self.pop_batch(t) if batch: self.wspusher.push_list(batch) data = input.lat, input.long self.batcher.push_item(t, data)
  58. I'VE GOT YOU COVERED class GeocoderTopology(Topology): # components redis =

    RedisSpout(1) parser = LogParserBolt(3) geolocator = GeolocatorBolt(2) pusher = WSPuserBolt(4) # plumbing parser.inputs.append(ShuffleGrouping(redis)) geolocator.inputs.append(ShuffleGrouping(parser)) pusher.inputs.append( FieldsGrouping(geolocator, 'lat', 'long'))
  59. INSIDE THE MACHINE

  60. THREE COMPONENTS

  61. NIMBUS

  62. ZOOKEEPER CLUSTER

  63. WORKER NODES

  64. DETAILS

  65. DEPLOYMENT

  66. EC2?

  67. DOTCLOUD!

  68. $ git clone \ https://github.com/gabrielgrant/storm-on-dotcloud.git $ dotcloud push mystorm storm-on-dotcloud

    … $ dotcloud scale worker=3
  69. TESTING

  70. JAVA

  71. CLOJURE

  72. ANT MAVEN

  73. LINEINGEN

  74. SCALING

  75. WHEN

  76. HOW

  77. THE FUTURE: EASY & AUTO

  78. THANKS!

  79. GABRIEL GRANT @gabrielmgrant gabrielgrant.ca