2016 - William Ting - Self-Healing Systems: The Road to 99.99% Uptime

Self-Healing Systems The Road to 99.99%

Hello! I am William Ting reddit engineer [email protected] u/wting wting

Hello! I am William Ting reddit engineer w at reddit.com
u/wting wting

Self-Healing Systems photo by Elijah Hail

99.99% 00:04:23 / month, 00:52:35 / yr downtime

Maslow’s Hierarchy of Reliability* Monitoring Incident Response Testing + Release
Eng Product, Capacity *Site Reliability Engineering (Beyer, Jones, Petoff, Murphy)

Transactions

“ I personally believe that within the data center, network
partitions very rarely happen [..] — Shay Banon (2010) primary author of Elasticsearch http://elasticsearch-users.115913.n3.nabble.com/CAP-theorem-td891925.html

Is Network Reliable? Microsoft1 ◎ 5.2 device failures/day ◎ 40.8
link failures/day ◎ ~59k packet loss per failure 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan)

link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows)

link failures/day ◎ ~59k packet loss per failure Google Chubby2 Over 700 days of operation: ◎ 4 outages due to network maintenance ◎ 2 outages due to “suspected network connectivity problems” Jeff Dean3 New Google cluster’s first year typically sees: ◎ 40-80 machines have 50% packet loss ◎ 8 network maintenances ◎ 3 router failures 1Understanding Network Failures in Data Centers (Gill, Jain, Nagappan) 2The Chubby lock service for loosely-coupled distributed systems (Mike Burrows) 3Design Lessons and Advice from Building Large Scale Distributed Systems (Jeff Dean)

Transactions

Transactions If a single node fails, then the entire call
fails.

Transactions 1% 1% 1% 1% 1% *artificial numbers

4.9% fail rate Way too high! P f (system)=1-(1-P f
(node))n

Transactions 15 ms 5 ms 10 ms 300 ms 20
ms *artificial numbers

Transactions 1% 1% 1% 1% 1% *artificial numbers

Retry All The Things! @retry(tries=sys.maxint, delay=0, backoff=0) def call_endpoint(..): ..

Idempotent Operations f(f(x)) = f(x) photo by Maico Amorim

∞ Retries ∞ ∞ ∞ ∞ ∞

3 Retries 3 3 3 3 3

3 Retries 3 3 9 3 27 3

Retry At Top 3 ≦3 ≦3 ≦3 ≦3

Retry At Top 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): ..

Retry At Top 3 <=3 <=3 <=3 <=3

@retry(tries=5, delay=60, backoff=2) def call_endpoint(..): .. Retry At Top 3

How Much Time? 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): ..

How Much Time? 3 @retry(tries=3, delay=0.3, backoff=0) def call_endpoint(..): client.get_orders(timeout=10)
client.post_order(timeout=10)

Time Budget 3 @retry(budget=15, delay=0.3, backoff=0) def call_endpoint(..): client.get_orders(timeout=10) client.post_order(timeout=10)

Retry Strategies Retry at the Top Exponential Backoff Time Budget

Queues photo by NASA

Queues

Queues 15 ms 5 ms 10 ms 300 ms 20
ms

Queues

Queues 1k req

Message Delivery photo by Jimmy Musto

“ There are only two hard problems in distributed systems:
2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery — Mathias Verraes @mathiasverraes https://twitter.com/mathiasverraes/status/632260618599403520

Message Delivery Pros Cons Example At Most Once easy to
implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once what you really want significant coordination overhead ?

Message Delivery Pros Cons Example At Most Once easy to
implement message loss UDP, Kafka At Least Once no message loss potential double processing Kafka, SQS, RabbitMQ Exactly Once* what you really want significant coordination overhead ?

Message Processing msg = client.get_message() client.commit(msg.id) process(msg)

Message Processing msg = client.get_message() client.commit(msg.id) process(msg) # What if
this fails?

Message Processing msg = client.get_message() client.commit(msg.id) process(msg) # What if
this fails? At Most Once Delivery

Message Processing msg = client.get_message() process(msg) client.commit(msg.id)

Message Processing msg = client.get_message() process(msg) client.commit(msg.id) # What if
this fails?

Message Processing msg = client.get_message() process(msg) client.commit(msg.id) # What if
this fails? At Least Once Delivery

Message Processing

Message Processing msg = client.get_message() process(msg) # This needs to
be idempotent! client.commit(msg.id) At Least Once Delivery

Exactly once delivery isn’t achievable, but exactly once processing is!

Exactly once delivery isn’t achievable, but exactly once processing is!*

Circuit Breakers photo by Israel Sundseth

order_cb = CircuitBreaker(initial_score=100) @order_cb(score=-10) def send_message(msg): ..

Redundancy *artificial numbers

order_cb = CircuitBreaker(initial_score=100) def send_message(msg): if order_cb.score > 0: with
order_cb(score=-10): client.send_message(msg) else: return http_client.post(msg)

Redundancy *artificial numbers 1% 1%

.01% fail rate compared to 1% for the same connection
originally P f (req)=P f (q)∩P f (http) in practice these events are not completely independent

Summary Unreliable Networks But that’s a hardware issue! Retry! Try
again. Fail again. Fail better. — Samuel Beckett Idempotency It’s déjà vu all over again. Circuit Breakers I promise you’re not my fallback option. Monitoring You can’t fix what you can’t see. Queues Not just for the Brits.

Thanks! Any questions? [email protected] u/wting wting

Performance Optimizations ◎ Retries in intermediary layers ◎ Multiple concurrent
requests to reduce 99th percentile latency1 1Spanner: Google’s Globally-Distributed Database

Presentation Design Template by Slides Carnival, photos from Unsplash (CC0
1.0). This presentations uses the following typographies and colors: ◎ Titles: Roboto Slab ◎ Body copy: Source Sans Pro ◎ Blue #0091ea ◎ Dark gray #263238 ◎ Medium gray #607d8b ◎ Light gray #cfd8dc

Distributed Systems’ Reliability photo by Martin Ezequiel

2016 - William Ting - Self-Healing Systems: The...

2016 - William Ting - Self-Healing Systems: The Road to 99.99% Uptime

More Decks by PyBay

Other Decks in Programming

Featured

Transcript