Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Storm: the Hadoop of Realtime Stream Processing

Storm: the Hadoop of Realtime Stream Processing

Twitter's new scalable, fault-tolerant, and simple(ish) stream programming system... with Python!

Gabriel Grant

March 25, 2012
Tweet

More Decks by Gabriel Grant

Other Decks in Programming

Transcript

  1. STORM
    Keeping it Real(time)
    Since 2011

    View full-size slide

  2. dotCloud.com

    View full-size slide

  3. VERSION
    THREE

    View full-size slide

  4. VERSION
    FOUR?

    View full-size slide

  5. ENTER, STORM

    View full-size slide

  6. REAL-TIME
    COMPUTATION

    View full-size slide

  7. DISTRIBUTED RPC
    &
    STREAM PROCESSING

    View full-size slide

  8. STREAM
    PROCESSING

    View full-size slide

  9. STORM:REAL-TIME
    HADOOP:BATCH

    View full-size slide

  10. FAULT
    TOLERANT

    View full-size slide

  11. PROCESS BATCHES
    RELIABLY

    View full-size slide

  12. PROTECT AGAINST
    HUMAN ERROR

    View full-size slide

  13. PROTECT AGAINST
    HUMAN ERROR

    View full-size slide

  14. THREE CORE
    ELEMENTS

    View full-size slide

  15. OUTPUT
    ROUTING?

    View full-size slide

  16. STREAM
    GROUPINGS

    View full-size slide

  17. SHUFFLE
    GROUPING

    View full-size slide

  18. FIELDS
    GROUPING

    View full-size slide

  19. GLOBAL
    GROUPING

    View full-size slide

  20. DOWN 'N
    DIRTY

    View full-size slide

  21. REAL-TIME GEOCODE
    BUCKETED CLIENT UPDATE

    View full-size slide

  22. CODE TIME:
    START ECLIPSE

    View full-size slide

  23. MULTILANG
    API

    View full-size slide

  24. I'VE GOT YOU
    COVERED

    View full-size slide

  25. UMBRELLA:
    IT PROTECTS YOU
    FROM STORM

    View full-size slide

  26. I'VE GOT YOU
    COVERED
    class RedisSpout(JVMSpout):
    class Default(Stream):
    fields = 'message'
    jvm_class = 'yieldbot.storm.spout'

    View full-size slide

  27. I'VE GOT YOU
    COVERED
    class LogParserBolt(AutoAckBolt):
    class Default(Stream):
    fields = 'ip_address'
    def execute(self, input):
    ip_address = parse_log(input.message)
    self.emit(ip_address)

    View full-size slide

  28. I'VE GOT YOU
    COVERED
    class GeolocatorBolt(AutoAckBolt):
    class Default(Stream):
    fields = 'lat', 'long'
    def __init__(self, *args, **kwargs):
    self.geoip = pygeoip.GeoIP('GeoLiteCity.dat')
    super(GeolocatorBolt, self) \
    .__init__(*args, **kwargs)
    def execute(self, input):
    record = self.geoip.record_by_addr(input.ip)
    lat = record['latitude']
    long_ = record['longitude']
    self.emit((lat, long_))

    View full-size slide

  29. I'VE GOT YOU
    COVERED
    class WSPuserBolt(Bolt):
    def __init__(self, *args, **kwargs):
    self.batcher = TimeBatcher()
    self.pusher = zerorpc.Client(timeout=None)
    url = os.environ['WSPUSHER_ZERORPC_URL']
    self.wspusher.connect(url)
    super(WSPusherBolt, self).__init__(*args, **kwargs
    def execute(self, input):
    t = time()
    batch = self.pop_batch(t)
    if batch:
    self.wspusher.push_list(batch)
    data = input.lat, input.long
    self.batcher.push_item(t, data)

    View full-size slide

  30. I'VE GOT YOU
    COVERED
    class GeocoderTopology(Topology):
    # components
    redis = RedisSpout(1)
    parser = LogParserBolt(3)
    geolocator = GeolocatorBolt(2)
    pusher = WSPuserBolt(4)
    # plumbing
    parser.inputs.append(ShuffleGrouping(redis))
    geolocator.inputs.append(ShuffleGrouping(parser))
    pusher.inputs.append(
    FieldsGrouping(geolocator, 'lat', 'long'))

    View full-size slide

  31. INSIDE
    THE MACHINE

    View full-size slide

  32. THREE
    COMPONENTS

    View full-size slide

  33. ZOOKEEPER
    CLUSTER

    View full-size slide

  34. $ git clone \
    https://github.com/gabrielgrant/storm-on-dotcloud.git
    $ dotcloud push mystorm storm-on-dotcloud

    $ dotcloud scale worker=3

    View full-size slide

  35. THE FUTURE:
    EASY & AUTO

    View full-size slide

  36. GABRIEL GRANT
    @gabrielmgrant
    gabrielgrant.ca

    View full-size slide