Resilience patterns (Server edition)

Slide 1

Slide 1 text

Resilience patterns (server edition) @slok69 https://slok.dev github.com/slok

Slide 2

Slide 2 text

HELLO! I am Xabier Larrakoetxea Let me be your guide in the resilience world

Slide 3

Slide 3 text

Agenda ✘ Introduction ✘ Problem/solution ✘ Experiments ✘ Conclusions

Slide 4

Slide 4 text

Resilience It is the ability to absorb or avoid damage without suffering complete failure. Wikipedia

Slide 5

Slide 5 text

Queueing theory Is the mathematical study of waiting line Wikipedia

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

R R R R R R R R R Concurrency Queue

Slide 9

Slide 9 text

Little’s law Is a theorem by John Little which states that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system Wikipedia

Slide 10

Slide 10 text

Slide 11

Slide 11 text

L = λ W Arrival rate (avg) Inflight Processing duration (avg)

Slide 12

Slide 12 text

R R R R R R R R R Concurrency Queue L W λ L = λ W

Slide 13

Slide 13 text

Our server can handle ~ 100 concurrent requests (due to any of these: CPU, Memory, fixed concurrency, threads, file descriptors….) Example

Slide 14

Slide 14 text

50 = 100RPS * 0.5s

Slide 15

Slide 15 text

100 = 100RPS * 1s

Slide 16

Slide 16 text

140 = 700RPS * 200ms

Slide 17

Slide 17 text

Queueing Processing Queueing is... good, is... bad Optimal usage

Slide 18

Slide 18 text

Problems of bad queueing ✘ Latency increase ✘ Collapse ✘ Cascading failure ✘ Crash ✘ ...

Slide 19

Slide 19 text

You can’t control your clients, protect your server

Slide 20

Slide 20 text

Solutions ✘ Autoscaling ✘ Rate limits ✘ Caches ✘ Load balancing ✘ Load shedding ✘ ...

Slide 21

Slide 21 text

Load shedding Is a technique used in information systems, especially web services, to avoid overloading the system and making it unavailable for all users. The idea is to ignore some requests rather than crashing a system and making it fail to serve any request. Wikipedia

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

Resilience patterns

Slide 24

Slide 24 text

The tests ✘ Handler with random latency (250-600ms) ✘ 15-25RPS is the regular capacity ✘ Test1: 15RPS for 3m (with 50RPS 1m middle spike) ✘ Test2: 60RPS for 15m ✘ Limits to 0.1 CPU and 50mb on memory ✘ Goresilience library (github.com/slok/goresilience) The tests demo can be found at https://github.com/slok/resilience-demo

Slide 25

Slide 25 text

Naked Server without protection.

Slide 26

Slide 26 text

Exp1: Naked server ✘ Accepts everything. ✘ No protection against spikes/bursts. ✘ The least recommended one. ✘ The most used one.

Slide 27

Slide 27 text

✘ Success: 61% ✘ P95: 15.8s ✘ P99: 24.7s 15 RPS with middle spike (50RPS) Unresponsive on spike

Slide 28

Slide 28 text

Max latencies 503 OOM ✘ Success: 9.62% ✘ P95: 24s ✘ P99: 29s (Max) 60 RPS (15m)

Slide 29

Slide 29 text

Naked 15RPS Success: 61% P95: 15.8s P99: 24.7s 60RPS Success: 9.62% P95: 24s P99: 29s

Slide 30

Slide 30 text

Good ✘ Simple Bad ✘ Cascading failure (latency) ✘ Crash (OOM)

Slide 31

Slide 31 text

Bulkhead Bulkhead for concurrency control

Slide 32

Slide 32 text

No content

Slide 33

Slide 33 text

App (without bulkhead) Bulkhead pattern Isolate elements of an application into pools so that if one fails, the others will continue to function. Service A Service B App (with bulkheads) Service A Service B

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Exp2: Bulkhead ✘ Bulkhead will limit the concurrent handlings. ✘ Queue timeout will clean long time queued requests. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes.

Slide 36

Slide 36 text

✘ Success: 83% ✘ P95: 6.5s ✘ P99: 8.5s 15 RPS with middle spike (50RPS)

Slide 37

Slide 37 text

P90 < 10s Controlled failures (429) ✘ Success: 38.75% ✘ P95: 9.3s ✘ P99: 12.2s 60 RPS (15m)

Slide 38

Slide 38 text

Naked Bulkhead 15RPS Success: 61% P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s

Slide 39

Slide 39 text

Good ✘ Simple ✘ Load shedding Bad ✘ Static configuration (requires load tests) ✗ Timeouts ✗ Number of workers (concurrency) ✘ Wrong configuration could waste capacity or overload the server

Slide 40

Slide 40 text

Bulkhead + circuit breaker Bulkhead for concurrency control and circuit breaker to control the constant overload

Slide 41

Slide 41 text

Circuit breaker pattern It is used to detect failures and encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. Closed Open Half open Error limit exceeded Timeout Tests failed Tests succeeded Regular flow Fail fast Regular flow

Slide 42

Slide 42 text

Exp3: Bulkhead + circuit breaker ✘ Bulkhead will limit the concurrent handlings. ✘ Circuit breaker will release the load (fast). ✘ Hystrix style pattern. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes. ✘ Circuit breaker wraps bulkhead.

Slide 43

Slide 43 text

✘ Success: 76.96% ✘ P95: 5.2s ✘ P99: 7.5s 15 RPS with middle spike (50RPS)

Slide 44

Slide 44 text

Less latency than plain bulkhead Circuit breaker opened (fail in spikes) ✘ Success: 30% ✘ P95: 6.5s ✘ P99: 8s 60 RPS (15m)

Slide 45

Slide 45 text

Naked Bulkhead Bk + CB 15RPS Success: 61% P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s

Slide 46

Slide 46 text

Good ✘ Load shedding ✘ Recover faster than bulkhead Bad ✘ Static configuration (requires load tests) ✘ Fail requests although server is ok

Slide 47

Slide 47 text

Adaptive resilience patterns

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

LIFO + CoDel (Facebook) Unfair request handling and aggressive timeouts on congestion Explanation and algorithm: https://queue.acm.org/detail.cfm?id=2839461 Original CoDel: https://queue.acm.org/detail.cfm?id=2209336 Airbnb uses also: https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d

Slide 50

Slide 50 text

CoDel (Controlled delay) We have 2 kinds of in queue timeouts: ✘ Regular (Interval): 100ms by default ✘ Aggressive (target): 5ms by default By default requests will have the interval timeout on queue. Measure when the queue was empty for the last time. If the duration since last time is greater than interval duration, congestion detected. If congested the requests will have the target timeout on queue.

Slide 51

Slide 51 text

CoDel (Controlled delay) // Enqueue request. if (queue.timeSinceEmpty() > interval) { // Congestion. timeout = target } else { timeout = interval } queue.enqueue(req, timeout)

Slide 52

Slide 52 text

Adaptive LIFO We have 2 kinds of in queue priorities: ✘ FIFO: On regular mode first in first out. ✘ LIFO: On congestion mode last in first out (unfair). When CoDel detects congestion it will change queue dequeue priority and the last requests will be served first (CoDel will clean old queued requests). The algorithm assumes that delayed queued requests are gone already and the new ones have more probability of being served. Dropbox proxy uses adaptive LIFO : https://blogs.dropbox.com/tech/2018/03/meet-bandaid-the-dropbox-service-proxy

Slide 53

Slide 53 text

Adaptive LIFO

Slide 54

Slide 54 text

Adaptive LIFO Low load High load First in First out Last in First out CoDel will drain if persists

Slide 55

Slide 55 text

R R R R R R R R R Concurrency Queue No congestion In

Slide 56

Slide 56 text

R R R R R R R R R Concurrency Queue Congestion In

Slide 57

Slide 57 text

Exp4: Adaptive LIFO + CoDel ✘ Based on TCP congestion CoDel algorithm and adapted by Facebook. ✘ Dynamic timeout and queue priority (Will adapt and change policies on congestion) . ✘ No configuration required (almost, safe defaults). ✘ Very aggressive timeouts on congestion.

Slide 58

Slide 58 text

R R R R R R R R R Concurrency Queue

Slide 59

Slide 59 text

✘ Success: 78% ✘ P95: 3.1s ✘ P99: 5.4s 15 RPS with middle spike (50RPS) Triggered congestion

Slide 60

Slide 60 text

✘ Success: 30% ✘ P95: 6s ✘ P99: 8s 60 RPS (15m)

Slide 61

Slide 61 text

Naked Bulkhead Bk + CB CoDel 15RPS Success: 61% P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s

Slide 62

Slide 62 text

Good ✘ Load shedding. ✘ Recover faster than circuit breaker. ✘ Unfair serving (serve new ones, leave old ones). ✘ No configuration Bad ✘ Static concurrency config (doesn’t affect much).

Slide 63

Slide 63 text

Concurrency limits (Netflix) Adapt concurrency Explanation and algorithms: https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581 Concurrecy-limits: https://github.com/Netflix/concurrency-limits

Slide 64

Slide 64 text

Concurrency limits Implements and integrates concepts from TCP congestion control to auto-detect concurrency limits for services in order to achieve optimal throughput with optimal latency. Concurrent requests Time Initial limit Discovered limit Real limit

Slide 65

Slide 65 text

Exp5: Adaptive concurrency (limits) ✘ Different algorithms (in this example AIMD but there are more like Vegas, Gradient...). ✘ Adaptive concurrency. ✘ Static queue timeout and priority. ✘ No configuration required. ✘ Adapts based on execution results (errors and latency).

Slide 66

Slide 66 text

R R R R R R R R R Concurrency Queue

Slide 67

Slide 67 text

✘ Success: 82% ✘ P95: 4s ✘ P99: 6s 15 RPS with middle spike (50RPS)

Slide 68

Slide 68 text

✘ Success: 36% ✘ P95: 4.8s ✘ P99: 6.5s 60 RPS (15m)

Slide 69

Slide 69 text

60 RPS (15m)

Slide 70

Slide 70 text

Naked Bulkhead Bk + CB CoDel CL 15RPS Success: 61% P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s Success: 82% P95: 4s P99: 6s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s Success: 36% P95: 4.8s P99: 6.5s

Slide 71

Slide 71 text

Good ✘ Load shedding. ✘ Recover faster. ✘ No configuration. ✘ Adapts to any environment (Hardware, autoscaling, noisy neighbor...) Bad ✘ Depending on load and algorithm can be slow to adapt (or not adapt).

Slide 72

Slide 72 text

Conclusions

Slide 73

Slide 73 text

Conclusions ✘ There is no winner, depends on the app and context. ✘ There is a loser: no protection (naked server). ✘ Adaptive algorithms add complexity (use libraries like goresilience) but are better for dynamic envs like cloud native. ✘ A bulkhead or circuit breaker can be enough. ✘ You can use a front proxy (or use sidecar pattern) ✘ Don’t trust your clients, protect yourself.

Slide 74

Slide 74 text

THANKS! Any questions? You can find me at ✘ @slok69 ✘ https://slok.dev ✘ github.com/slok http://bit.ly/resilience-form