Self-Healing Systems: The Road to 99.99%

Self-Healing Systems: The Road to 99.99%

Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust.

F75f568ba69e6c683b9a2e5c8fc7ab67?s=128

William Ting

August 22, 2016
Tweet

Transcript

  1. 6.

    Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  2. 7.

    Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  3. 11.

    “ I personally believe that within the data center, network

    partitions very rarely happen [..] — Shay Banon (2010) primary author of Elasticsearch http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html
  4. 12.

    Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan)
  5. 13.

    Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows)
  6. 14.

    Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” Jeff Dean3 New Google cluster’s first year typically sees: ◎ 40-80 machines have 50% packet loss ◎ 8 network maintenances ◎ 3 router failures 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows) 3Design Lessons and Advice from Building Large Scale Distributed Systems (Jeff Dean)
  7. 19.
  8. 35.
  9. 36.
  10. 38.
  11. 41.

    “ There are only two hard problems in distributed systems:

    2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery — Mathias Verraes @mathiasverraes https://twitter.com/mathiasverraes/status/632260618599403520
  12. 42.

    Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once what you really want significant coordination overhead ?
  13. 43.

    Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once* what you really want significant coordination overhead ?
  14. 51.

    Message Processing msg = client.get_message() process(msg) # This needs to

    be idempotent! client.commit(msg.id) At Least Once Delivery
  15. 57.

    order_cb = CircuitBreaker(initial_score=100) def send_message(msg): if order_cb.score > 0: with

    order_cb(score=-10): client.send_message(msg) else: return http_client.post(msg)
  16. 59.

    .01% fail rate compared to 1% for the same connection

    originally P f (req)=P f (q)∩P f (http) in practice these events are not completely independent
  17. 60.

    Summary Unreliable Networks But that’s a hardware issue! Retry! Try

    again. Fail again. Fail better. — Samuel Beckett Idempotency It’s déjà vu all over again. Circuit Breakers I promise you’re not my fallback option. Monitoring You can’t fix what you can’t see. Queues Not just for the Brits.
  18. 62.

    Presentation Design Template by Slides Carnival, photos from Unsplash (CC0

    1.0). This presentations uses the following typographies and colors: ◎ Titles: Roboto Slab ◎ Body copy: Source Sans Pro ◎ Blue #0091ea ◎ Dark gray #263238 ◎ Medium gray #607d8b ◎ Light gray #cfd8dc
  19. 63.

    Performance Optimizations ◎ Retries in intermediary layers ◎ Multiple concurrent

    requests to reduce 99th percentile latency1 1Spanner: Google’s Globally-Distributed Database