Pro Yearly is on sale from $80 to $50! »

Self-Healing Systems: The Road to 99.99%

Self-Healing Systems: The Road to 99.99%

Stop firefighting and start fireproofing! There are many tools that make oncall easier and increase availability, but we'll be mostly focusing on a few principles and design patterns that help make your systems more robust.

F75f568ba69e6c683b9a2e5c8fc7ab67?s=128

William Ting

August 22, 2016
Tweet

Transcript

  1. Self-Healing Systems The Road to 99.99%

  2. Hello! I am William Ting reddit engineer w@reddit.com u/wting wting

    @_wting
  3. Hello! I am William Ting reddit engineer w at reddit.com

    u/wting wting @_wting
  4. Self-Healing Systems photo by Elijah Hail

  5. 99.99% 00:04:23 / month, 00:52:35 / yr downtime

  6. Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  7. Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release

    Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)
  8. Transactions

  9. Transactions

  10. Transactions

  11. “ I personally believe that within the data center, network

    partitions very rarely happen [..] — Shay Banon (2010) primary author of Elasticsearch http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html
  12. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan)
  13. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows)
  14. Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8

    link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” Jeff Dean3 New Google cluster’s first year typically sees: ◎ 40-80 machines have 50% packet loss ◎ 8 network maintenances ◎ 3 router failures 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows) 3Design Lessons and Advice from Building Large Scale Distributed Systems (Jeff Dean)
  15. Transactions

  16. Transactions If a single node fails, then the entire call

    fails.
  17. Transactions 1% 1% 1% 1% 1% *artificial numbers

  18. 4.9% fail rate Way too high! P f (system)=1-(1-P f

    (node))n
  19. Transactions 15 ms 5 ms 10 ms 300 ms 20

    ms *artificial numbers
  20. Transactions 1% 1% 1% 1% 1% *artificial numbers

  21. Retry All The Things! @retry(tries=sys.maxint, delay=0, backoff=0) def call_endpoint(..): ..

  22. Idempotent Operations f(f(x)) = f(x) photo by Maico Amorim

  23. ∞ Retries ∞ ∞ ∞ ∞ ∞

  24. 3 Retries 3 3 3 3 3

  25. 3 Retries 3 3 9 3 27 3

  26. Retry At Top 3 ≦3 ≦3 ≦3 ≦3

  27. Retry At Top 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): ..

  28. Retry At Top 3 <=3 <=3 <=3 <=3

  29. @retry(tries=5, delay=60, backoff=2) def call_endpoint(..): .. Retry At Top 3

  30. How Much Time? 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): ..

  31. How Much Time? 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): client.get_orders(timeout=10)

    client.post_order(timeout=10)
  32. Time Budget 3 @retry(budget=15, delay=0.3, backoff=0) def call_endpoint(..): client.get_orders(timeout=10) client.post_order(timeout=10)

  33. Retry Strategies Retry at the Top Exponential Backoff Time Budget

  34. Queues photo by NASA

  35. HTTP

  36. Queues

  37. Queues 15 ms 5 ms 10 ms 300 ms 20

    ms
  38. Queues

  39. Queues 1k req

  40. Message Delivery photo by Jimmy Musto

  41. “ There are only two hard problems in distributed systems:

    2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery — Mathias Verraes @mathiasverraes https://twitter.com/mathiasverraes/status/632260618599403520
  42. Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once what you really want significant coordination overhead ?
  43. Message Delivery Pros Cons Example At Most Once easy to

    implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once* what you really want significant coordination overhead ?
  44. Message Processing msg = client.get_message() client.commit(msg.id) process(msg)

  45. Message Processing msg = client.get_message() client.commit(msg.id) process(msg) # What if

    this fails?
  46. Message Processing msg = client.get_message() client.commit(msg.id) process(msg) # What if

    this fails? At Most Once Delivery
  47. Message Processing msg = client.get_message() process(msg) client.commit(msg.id)

  48. Message Processing msg = client.get_message() process(msg) client.commit(msg.id) # What if

    this fails?
  49. Message Processing msg = client.get_message() process(msg) client.commit(msg.id) # What if

    this fails? At Least Once Delivery
  50. Message Processing

  51. Message Processing msg = client.get_message() process(msg) # This needs to

    be idempotent! client.commit(msg.id) At Least Once Delivery
  52. Exactly once delivery isn’t achievable, but exactly once processing is!

  53. Exactly once delivery isn’t achievable, but exactly once processing is!*

  54. Circuit Breakers photo by Israel Sundseth

  55. order_cb = CircuitBreaker(initial_score=100) @order_cb(score=-10) def send_message(msg): ..

  56. Redundancy *artificial numbers

  57. order_cb = CircuitBreaker(initial_score=100) def send_message(msg): if order_cb.score > 0: with

    order_cb(score=-10): client.send_message(msg) else: return http_client.post(msg)
  58. Redundancy *artificial numbers 1% 1%

  59. .01% fail rate compared to 1% for the same connection

    originally P f (req)=P f (q)∩P f (http) in practice these events are not completely independent
  60. Summary Unreliable Networks But that’s a hardware issue! Retry! Try

    again. Fail again. Fail better. — Samuel Beckett Idempotency It’s déjà vu all over again. Circuit Breakers I promise you’re not my fallback option. Monitoring You can’t fix what you can’t see. Queues Not just for the Brits.
  61. Thanks! Any questions? w@reddit.com u/wting wting @_wting

  62. Presentation Design Template by Slides Carnival, photos from Unsplash (CC0

    1.0). This presentations uses the following typographies and colors: ◎ Titles: Roboto Slab ◎ Body copy: Source Sans Pro ◎ Blue #0091ea ◎ Dark gray #263238 ◎ Medium gray #607d8b ◎ Light gray #cfd8dc
  63. Performance Optimizations ◎ Retries in intermediary layers ◎ Multiple concurrent

    requests to reduce 99th percentile latency1 1Spanner: Google’s Globally-Distributed Database
  64. Distributed Systems’ Reliability photo by Martin Ezequiel