Resilience patterns (Server edition)

Resilience patterns (server edition) @slok69 https://slok.dev github.com/slok

HELLO! I am Xabier Larrakoetxea Let me be your guide
in the resilience world

Agenda ✘ Introduction ✘ Problem/solution ✘ Experiments ✘ Conclusions

Resilience It is the ability to absorb or avoid damage
without suffering complete failure. Wikipedia

Queueing theory Is the mathematical study of waiting line Wikipedia

R R R R R R R R R Concurrency
Queue

Little’s law Is a theorem by John Little which states
that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system Wikipedia

L = λ W Arrival rate (avg) Inflight Processing duration
(avg)

Queue L W λ L = λ W

Our server can handle ~ 100 concurrent requests (due to
any of these: CPU, Memory, fixed concurrency, threads, file descriptors….) Example

50 = 100RPS * 0.5s

100 = 100RPS * 1s

140 = 700RPS * 200ms

Queueing Processing Queueing is... good, is... bad Optimal usage

Problems of bad queueing ✘ Latency increase ✘ Collapse ✘
Cascading failure ✘ Crash ✘ ...

You can’t control your clients, protect your server

Solutions ✘ Autoscaling ✘ Rate limits ✘ Caches ✘ Load
balancing ✘ Load shedding ✘ ...

Load shedding Is a technique used in information systems, especially
web services, to avoid overloading the system and making it unavailable for all users. The idea is to ignore some requests rather than crashing a system and making it fail to serve any request. Wikipedia

Resilience patterns

The tests ✘ Handler with random latency (250-600ms) ✘ 15-25RPS
is the regular capacity ✘ Test1: 15RPS for 3m (with 50RPS 1m middle spike) ✘ Test2: 60RPS for 15m ✘ Limits to 0.1 CPU and 50mb on memory ✘ Goresilience library (github.com/slok/goresilience) The tests demo can be found at https://github.com/slok/resilience-demo

Naked Server without protection.

Exp1: Naked server ✘ Accepts everything. ✘ No protection against
spikes/bursts. ✘ The least recommended one. ✘ The most used one.

✘ Success: 61% ✘ P95: 15.8s ✘ P99: 24.7s 15
RPS with middle spike (50RPS) Unresponsive on spike

Max latencies 503 OOM ✘ Success: 9.62% ✘ P95: 24s
✘ P99: 29s (Max) 60 RPS (15m)

Naked 15RPS Success: 61% P95: 15.8s P99: 24.7s 60RPS Success:
9.62% P95: 24s P99: 29s

Good ✘ Simple Bad ✘ Cascading failure (latency) ✘ Crash
(OOM)

Bulkhead Bulkhead for concurrency control

App (without bulkhead) Bulkhead pattern Isolate elements of an application
into pools so that if one fails, the others will continue to function. Service A Service B App (with bulkheads) Service A Service B

Exp2: Bulkhead ✘ Bulkhead will limit the concurrent handlings. ✘
Queue timeout will clean long time queued requests. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes.

✘ Success: 83% ✘ P95: 6.5s ✘ P99: 8.5s 15
RPS with middle spike (50RPS)

P90 < 10s Controlled failures (429) ✘ Success: 38.75% ✘
P95: 9.3s ✘ P99: 12.2s 60 RPS (15m)

Naked Bulkhead 15RPS Success: 61% P95: 15.8s P99: 24.7s Success:
83% P95: 6.5s P99: 8.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s

Good ✘ Simple ✘ Load shedding Bad ✘ Static configuration
(requires load tests) ✗ Timeouts ✗ Number of workers (concurrency) ✘ Wrong configuration could waste capacity or overload the server

Bulkhead + circuit breaker Bulkhead for concurrency control and circuit
breaker to control the constant overload

Circuit breaker pattern It is used to detect failures and
encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. Closed Open Half open Error limit exceeded Timeout Tests failed Tests succeeded Regular flow Fail fast Regular flow

Exp3: Bulkhead + circuit breaker ✘ Bulkhead will limit the
concurrent handlings. ✘ Circuit breaker will release the load (fast). ✘ Hystrix style pattern. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes. ✘ Circuit breaker wraps bulkhead.

✘ Success: 76.96% ✘ P95: 5.2s ✘ P99: 7.5s 15

Less latency than plain bulkhead Circuit breaker opened (fail in
spikes) ✘ Success: 30% ✘ P95: 6.5s ✘ P99: 8s 60 RPS (15m)

Naked Bulkhead Bk + CB 15RPS Success: 61% P95: 15.8s
P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s

Good ✘ Load shedding ✘ Recover faster than bulkhead Bad
✘ Static configuration (requires load tests) ✘ Fail requests although server is ok

Adaptive resilience patterns

LIFO + CoDel (Facebook) Unfair request handling and aggressive timeouts
on congestion Explanation and algorithm: https://queue.acm.org/detail.cfm?id=2839461 Original CoDel: https://queue.acm.org/detail.cfm?id=2209336 Airbnb uses also: https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d

CoDel (Controlled delay) We have 2 kinds of in queue
timeouts: ✘ Regular (Interval): 100ms by default ✘ Aggressive (target): 5ms by default By default requests will have the interval timeout on queue. Measure when the queue was empty for the last time. If the duration since last time is greater than interval duration, congestion detected. If congested the requests will have the target timeout on queue.

CoDel (Controlled delay) // Enqueue request. if (queue.timeSinceEmpty() > interval)
{ // Congestion. timeout = target } else { timeout = interval } queue.enqueue(req, timeout)

Adaptive LIFO We have 2 kinds of in queue priorities:
✘ FIFO: On regular mode first in first out. ✘ LIFO: On congestion mode last in first out (unfair). When CoDel detects congestion it will change queue dequeue priority and the last requests will be served first (CoDel will clean old queued requests). The algorithm assumes that delayed queued requests are gone already and the new ones have more probability of being served. Dropbox proxy uses adaptive LIFO : https://blogs.dropbox.com/tech/2018/03/meet-bandaid-the-dropbox-service-proxy

Adaptive LIFO

Adaptive LIFO Low load High load First in First out
Last in First out CoDel will drain if persists

Queue No congestion In

Queue Congestion In

Exp4: Adaptive LIFO + CoDel ✘ Based on TCP congestion
CoDel algorithm and adapted by Facebook. ✘ Dynamic timeout and queue priority (Will adapt and change policies on congestion) . ✘ No configuration required (almost, safe defaults). ✘ Very aggressive timeouts on congestion.

Queue

✘ Success: 78% ✘ P95: 3.1s ✘ P99: 5.4s 15
RPS with middle spike (50RPS) Triggered congestion

✘ Success: 30% ✘ P95: 6s ✘ P99: 8s 60
RPS (15m)

Naked Bulkhead Bk + CB CoDel 15RPS Success: 61% P95:
15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s

Good ✘ Load shedding. ✘ Recover faster than circuit breaker.
✘ Unfair serving (serve new ones, leave old ones). ✘ No configuration Bad ✘ Static concurrency config (doesn’t affect much).

Concurrency limits (Netflix) Adapt concurrency Explanation and algorithms: https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581 Concurrecy-limits:
https://github.com/Netflix/concurrency-limits

Concurrency limits Implements and integrates concepts from TCP congestion control
to auto-detect concurrency limits for services in order to achieve optimal throughput with optimal latency. Concurrent requests Time Initial limit Discovered limit Real limit

Exp5: Adaptive concurrency (limits) ✘ Different algorithms (in this example
AIMD but there are more like Vegas, Gradient...). ✘ Adaptive concurrency. ✘ Static queue timeout and priority. ✘ No configuration required. ✘ Adapts based on execution results (errors and latency).

Queue

✘ Success: 82% ✘ P95: 4s ✘ P99: 6s 15

✘ Success: 36% ✘ P95: 4.8s ✘ P99: 6.5s 60
RPS (15m)

60 RPS (15m)

Naked Bulkhead Bk + CB CoDel CL 15RPS Success: 61%
P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s Success: 82% P95: 4s P99: 6s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s Success: 36% P95: 4.8s P99: 6.5s

Good ✘ Load shedding. ✘ Recover faster. ✘ No configuration.
✘ Adapts to any environment (Hardware, autoscaling, noisy neighbor...) Bad ✘ Depending on load and algorithm can be slow to adapt (or not adapt).

Conclusions

Conclusions ✘ There is no winner, depends on the app
and context. ✘ There is a loser: no protection (naked server). ✘ Adaptive algorithms add complexity (use libraries like goresilience) but are better for dynamic envs like cloud native. ✘ A bulkhead or circuit breaker can be enough. ✘ You can use a front proxy (or use sidecar pattern) ✘ Don’t trust your clients, protect yourself.

THANKS! Any questions? You can find me at ✘ @slok69
✘ https://slok.dev ✘ github.com/slok http://bit.ly/resilience-form

Resilience patterns (Server edition)

Resilience patterns (Server edition)

More Decks by Xabier Larrakoetxea

Other Decks in Technology

Featured

Transcript