Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 - William Ting - Self-Healing Systems: The Road to 99.99% Uptime

PyBay
August 21, 2016

2016 - William Ting - Self-Healing Systems: The Road to 99.99% Uptime

Description
Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust.

Abstract
Feature velocity is typically a higher priority early in a software's lifecycle, but as the system matures there is an effort to start fireproofing the system. On the Yelp Transactions Platform team we've used a combination of circuit breakers, queues, and idempotent operations to minimize downtime and waking up in the middle of the night.

We'll take a look at how these design patterns help us in a distributed system, when they should be used, and common pitfalls associated.

Bio
William Ting is a longtime FOSS advocate with contributions in various projects (Pelican, autojump, pyramid_swagger, Rust, GNOME). He's currently an infrastructure engineer at Reddit, and previously on the Yelp Transaction Platform team.

https://youtu.be/_nybfUxPhco

PyBay

August 21, 2016
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  2. Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  3. “ I personally believe that within the data center, network

    partitions very rarely happen [..] — Shay Banon (2010) primary author of Elasticsearch http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html
  4. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan)
  5. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows)
  6. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” Jeff Dean3 New Google cluster’s first year typically sees: ◎ 40-80 machines have 50% packet loss ◎ 8 network maintenances ◎ 3 router failures 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows) 3Design Lessons and Advice from Building Large Scale Distributed Systems (Jeff Dean)
  7. “ There are only two hard problems in distributed systems:

    2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery — Mathias Verraes @mathiasverraes https://twitter.com/mathiasverraes/status/632260618599403520
  8. Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once what you really want significant coordination overhead ?
  9. Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once* what you really want significant coordination overhead ?
  10. Message Processing msg = client.get_message() process(msg) # This needs to

    be idempotent! client.commit(msg.id) At Least Once Delivery
  11. order_cb = CircuitBreaker(initial_score=100) def send_message(msg): if order_cb.score > 0: with

    order_cb(score=-10): client.send_message(msg) else: return http_client.post(msg)
  12. .01% fail rate compared to 1% for the same connection

    originally P f (req)=P f (q)∩P f (http) in practice these events are not completely independent
  13. Summary Unreliable Networks But that’s a hardware issue! Retry! Try

    again. Fail again. Fail better. — Samuel Beckett Idempotency It’s déjà vu all over again. Circuit Breakers I promise you’re not my fallback option. Monitoring You can’t fix what you can’t see. Queues Not just for the Brits.
  14. Performance Optimizations ◎ Retries in intermediary layers ◎ Multiple concurrent

    requests to reduce 99th percentile latency1 1Spanner: Google’s Globally-Distributed Database
  15. Presentation Design Template by Slides Carnival, photos from Unsplash (CC0

    1.0). This presentations uses the following typographies and colors: ◎ Titles: Roboto Slab ◎ Body copy: Source Sans Pro ◎ Blue #0091ea ◎ Dark gray #263238 ◎ Medium gray #607d8b ◎ Light gray #cfd8dc