Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Streaming Your Way to Web Scale: Scaling Bitly via Stream-Based Processing

Peter Herndon
September 25, 2014

Streaming Your Way to Web Scale: Scaling Bitly via Stream-Based Processing

In this talk, I will cover how a web site naturally evolves from a simple web application, through the process of adding an asynchronous job queue, to a fully stream-based and service-oriented architecture, as implemented at Bitly. We will explore how Bitly arrived at this architecture, the pros and cons of operating a streaming-based system, and examine some of the tools we built to make our service possible, particularly NSQ (https://github.com/bitly/nsq/), our message queue system.

Peter Herndon

September 25, 2014
Tweet

More Decks by Peter Herndon

Other Decks in Programming

Transcript

  1. 16 x 9 Streaming Your Way to Web Scale: Scaling

    Bitly via Stream-Based Processing! Surge! September 25, 2014, 2:30pm 85  slides   23  Images   21  diagrams
  2. or

  3. Define streaming — public API surface that puts events into

    message queues that are consumed by web services
  4. Basic Web App Web App Database Basic web app. In

    Python, Django + Postgres, Flask + Postgres, Tornado + Postgres
  5. Scaling the Mountain (of Load) Web App Database Web App

    Web App First bottleneck: web layer
  6. Cache Rules Everything Around Me Database Web App Web App

    Web App Cache Remove web layer bottleneck, next is DB, so add caching layer
  7. You Want Me to Replicate You Database Database Web App

    Web App Web App Cache Works for a while, but DB requests still take too long, so replicate
  8. Shards Here, Shards There Web App Web App Web App

    Cache Database Database Database Database Database Database …and then shard
  9. It’s Off to Work I Go! Database Web App Cache

    Queue Worker But individual requests still take too long, because doing too much work. So add message queue and worker. In Python, Celery
  10. Working the Event Database Web App Cache Queue Worker Worker

    pulls message off queue and processes
  11. Write Here, Write Now Database Web App Cache Queue Worker

    Worker writes results to database, file system, etc.
  12. Write Here, Write Now (redux) Web App Database Worker Queue

    Instead, imagine worker writes results
  13. Sending Out an SMS Web App Database Worker Queue and

    web app writes event messages to queue local to the web service
  14. Listen, listen, LISTEN Web App Database Worker Queue Queue but

    worker is listening to a queue running on another server
  15. Workin’ On a Chain(ed) Gang Web App Database Worker Queue

    Web App Database Worker Queue Web App Database Worker Queue
  16. Look it up! Web App Database Worker Queue Queue Worker

    finds queue with topic via nsqlookupd
  17. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')          )          run() /<service>/queuereader_<service>.py How do I find things?
  18. Sending Out an SMS Web App Database Worker Queue topic:

    ‘spam_api’ First time app writes to a TOPIC in the local nsqd
  19. Sending Out an SMS Web App Database Worker Queue topic:

    ‘spam_api’ nsqlookupd nsqd creates the topic and registers it with nsqlookupd
  20. Where Am I Again? Web App Database Worker nsqd topic:

    'spam_counter' nsqd topic: 'spam_api' nsqlookupd topic: 'spam_api'? Worker in another service looking for a topic asks nsqlookupd, replies with address
  21. Talkin’ ‘Bout Something Web App Database Worker nsqd topic: 'spam_counter'

    nsqd topic: 'spam_api' channel: 'spam_counter' queuereader connects to nsqd, registers a channel
  22. Cross-Town Traffic nsqd topic: 'spam_api' Worker Worker Worker channel: 'spam_counter'

    messages are divided by # of subscribers to a channel; allows horizontal scaling
  23. Channeling the Ghost nsqd topic: 'spam_api' Worker Worker Worker channel:

    'spam_counter' Worker channel: 'nsq_to_file'' full copy of all messages to each channel
  24. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run() /<service>/queuereader_<service>.py, 1 of 4
  25. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run() /<service>/queuereader_<service>.py, 1 of 4
  26. def  validate_message(message):          if  message.get('o')  ==  '+'

     and  message.get('l'):                  return  True          if  message.get('o')  ==  '-­‐'  and  message.get('l')\            and  message.get('bl'):                     return  True          return  False /<service>/queuereader_<service>.py, 2 of 4
  27. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run() /<service>/queuereader_<service>.py, 1 of 4
  28. def  count_spam_actions(message,  nsq_msg):          key_section  =  statsd_keys[message['o']]

             key  =  key_section.get(message['l'],  key_section['default'])          statsd.incr(key)                    if  key  ==  'remove_by_manual':                  key_section  =  statsd_keys['-­‐manual']                  key  =  key_section.get(message['bl'],  key_section['default'])                  statsd.incr(key)                    return  nsq_msg.finish() /<service>/queuereader_<service>.py, 3 of 4
  29. def  count_spam_actions(message,  nsq_msg):          key_section  =  statsd_keys[message['o']]

             key  =  key_section.get(message['l'],  key_section['default'])          statsd.incr(key)                    if  key  ==  'remove_by_manual':                  key_section  =  statsd_keys['-­‐manual']                  key  =  key_section.get(message['bl'],  key_section['default'])                  statsd.incr(key)                    return  nsq_msg.finish() /<service>/queuereader_<service>.py, 3 of 4
  30. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run() /<service>/queuereader_<service>.py, 4 of 4
  31. if  __name__  ==  "__main__":          tornado.options.parse_command_line()  

           logatron_client.setup()                    Reader(                  topic=settings.get('nsqd_output_topic'),                  channel='queuereader_spam_metrics',                  validate_method=validate_message,                  message_handler=count_spam_actions,                  lookupd_http_addresses=settings.get('nsq_lookupd')        )          run() /<service>/queuereader_<service>.py, 4 of 4
  32. Features & Guarantees! (aka Trade-Offs) Distributed, No SPOF || Horizontally

    Scalable || TLS || statsd integration || Easy to Deploy || Cluster Administration
  33. Streaming Architecture Easy to build new services Easy to scale

    individual components horizontally Durable in the face of single component failure Distributed
  34. THINGS TO THINK ABOUT Monitoring, monitoring, monitoring Failure modes —

    how can things fail? How does your application as a whole handle the failure of individual components? Measurement — metrics show the range Timeouts — connection timeouts, DNS timeouts — a slow network is the same as a failed service
  35. Web Scale - http://www.mongodb-is-web-scale.com! Waterfall - https://www.flickr.com/photos/desatur8/14949285342! Tornado - https://www.flickr.com/photos/indigente/798304!

    John de Lancie - https://www.flickr.com/photos/cayusa/1394930005! Ben Whishaw - https://www.flickr.com/photos/rossendalewadey/6032496676! Command Key - https://www.flickr.com/photos/klash/3175479797! iPhone6 Event - https://www.flickr.com/photos/notionscapital/15067798867! Wait for iPhone - https://www.flickr.com/photos/josh_gray/662814907! NSQ Logo - http://nsq.io! ! All other photos by T. Peter Herndon! ! Photo Credits