Resilience
It is the ability to absorb or avoid damage without
suffering complete failure.
Wikipedia
Slide 5
Slide 5 text
Queueing theory
Is the mathematical study of waiting line
Wikipedia
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
R R R R R R R R R
Concurrency
Queue
Slide 9
Slide 9 text
Little’s law
Is a theorem by John Little which states that the
long-term average number L of customers in a
stationary system is equal to the long-term average
effective arrival rate λ multiplied by the average
time W that a customer spends in the system
Wikipedia
Slide 10
Slide 10 text
Little’s law
Is a theorem by John Little which states that the
long-term average number L of customers in a
stationary system is equal to the long-term average
effective arrival rate λ multiplied by the average
time W that a customer spends in the system
Wikipedia
Slide 11
Slide 11 text
L = λ W
Arrival rate (avg)
Inflight
Processing duration (avg)
Slide 12
Slide 12 text
R R R R R R R R R
Concurrency
Queue
L
W
λ
L = λ W
Slide 13
Slide 13 text
Our server can handle ~ 100 concurrent requests
(due to any of these: CPU, Memory, fixed
concurrency, threads, file descriptors….)
Example
Slide 14
Slide 14 text
50 = 100RPS * 0.5s
Slide 15
Slide 15 text
100 = 100RPS * 1s
Slide 16
Slide 16 text
140 = 700RPS * 200ms
Slide 17
Slide 17 text
Queueing Processing
Queueing is... good, is... bad
Optimal usage
Slide 18
Slide 18 text
Problems of bad queueing
✘ Latency increase
✘ Collapse
✘ Cascading failure
✘ Crash
✘ ...
Slide 19
Slide 19 text
You can’t control your clients, protect your server
Load shedding
Is a technique used in information systems,
especially web services, to avoid overloading the
system and making it unavailable for all users. The
idea is to ignore some requests rather than crashing
a system and making it fail to serve any request.
Wikipedia
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
Resilience patterns
Slide 24
Slide 24 text
The tests
✘ Handler with random latency (250-600ms)
✘ 15-25RPS is the regular capacity
✘ Test1: 15RPS for 3m (with 50RPS 1m middle spike)
✘ Test2: 60RPS for 15m
✘ Limits to 0.1 CPU and 50mb on memory
✘ Goresilience library (github.com/slok/goresilience)
The tests demo can be found at https://github.com/slok/resilience-demo
Slide 25
Slide 25 text
Naked
Server without protection.
Slide 26
Slide 26 text
Exp1: Naked server
✘ Accepts everything.
✘ No protection against spikes/bursts.
✘ The least recommended one.
✘ The most used one.
Slide 27
Slide 27 text
✘ Success: 61%
✘ P95: 15.8s
✘ P99: 24.7s
15 RPS with middle spike (50RPS)
Unresponsive on spike
Good
✘ Simple
Bad
✘ Cascading failure (latency)
✘ Crash (OOM)
Slide 31
Slide 31 text
Bulkhead
Bulkhead for concurrency control
Slide 32
Slide 32 text
No content
Slide 33
Slide 33 text
App (without bulkhead)
Bulkhead pattern
Isolate elements of an application into pools so that if one fails, the
others will continue to function.
Service A Service B
App (with bulkheads)
Service A Service B
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
Exp2: Bulkhead
✘ Bulkhead will limit the concurrent handlings.
✘ Queue timeout will clean long time queued requests.
✘ Needs to be configured .
✘ Will protect us from bursts/spikes.
Good
✘ Simple
✘ Load shedding
Bad
✘ Static configuration (requires load tests)
✗ Timeouts
✗ Number of workers (concurrency)
✘ Wrong configuration could waste capacity
or overload the server
Slide 40
Slide 40 text
Bulkhead + circuit breaker
Bulkhead for concurrency control and circuit
breaker to control the constant overload
Slide 41
Slide 41 text
Circuit breaker pattern
It is used to detect failures and encapsulates the logic of preventing a failure
from constantly recurring, during maintenance, temporary external system
failure or unexpected system difficulties.
Closed Open
Half open
Error limit exceeded
Timeout
Tests failed
Tests succeeded
Regular flow Fail fast
Regular flow
Slide 42
Slide 42 text
Exp3: Bulkhead + circuit breaker
✘ Bulkhead will limit the concurrent handlings.
✘ Circuit breaker will release the load (fast).
✘ Hystrix style pattern.
✘ Needs to be configured .
✘ Will protect us from bursts/spikes.
✘ Circuit breaker wraps bulkhead.
Good
✘ Load shedding
✘ Recover faster than bulkhead
Bad
✘ Static configuration (requires load tests)
✘ Fail requests although server is ok
Slide 47
Slide 47 text
Adaptive resilience
patterns
Slide 48
Slide 48 text
No content
Slide 49
Slide 49 text
LIFO + CoDel
(Facebook)
Unfair request handling and aggressive timeouts on
congestion
Explanation and algorithm: https://queue.acm.org/detail.cfm?id=2839461
Original CoDel: https://queue.acm.org/detail.cfm?id=2209336
Airbnb uses also: https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d
Slide 50
Slide 50 text
CoDel (Controlled delay)
We have 2 kinds of in queue timeouts:
✘ Regular (Interval): 100ms by default
✘ Aggressive (target): 5ms by default
By default requests will have the interval timeout on queue.
Measure when the queue was empty for the last time. If the duration since
last time is greater than interval duration, congestion detected.
If congested the requests will have the target timeout on queue.
Adaptive LIFO
We have 2 kinds of in queue priorities:
✘ FIFO: On regular mode first in first out.
✘ LIFO: On congestion mode last in first out (unfair).
When CoDel detects congestion it will change queue dequeue priority and
the last requests will be served first (CoDel will clean old queued requests).
The algorithm assumes that delayed queued requests are gone already and
the new ones have more probability of being served.
Dropbox proxy uses adaptive LIFO : https://blogs.dropbox.com/tech/2018/03/meet-bandaid-the-dropbox-service-proxy
Slide 53
Slide 53 text
Adaptive LIFO
Slide 54
Slide 54 text
Adaptive LIFO
Low load High load
First in
First out
Last in
First out
CoDel will drain if
persists
Slide 55
Slide 55 text
R R R R R R R R R
Concurrency
Queue
No congestion
In
Slide 56
Slide 56 text
R R R R R R R R R
Concurrency
Queue
Congestion
In
Slide 57
Slide 57 text
Exp4: Adaptive LIFO + CoDel
✘ Based on TCP congestion CoDel algorithm and adapted
by Facebook.
✘ Dynamic timeout and queue priority (Will adapt and
change policies on congestion) .
✘ No configuration required (almost, safe defaults).
✘ Very aggressive timeouts on congestion.
Good
✘ Load shedding.
✘ Recover faster than circuit breaker.
✘ Unfair serving (serve new ones, leave old ones).
✘ No configuration
Bad
✘ Static concurrency config (doesn’t affect much).
Concurrency limits
Implements and integrates concepts from TCP congestion control to
auto-detect concurrency limits for services in order to achieve optimal
throughput with optimal latency.
Concurrent requests
Time
Initial limit
Discovered limit
Real limit
Slide 65
Slide 65 text
Exp5: Adaptive concurrency (limits)
✘ Different algorithms (in this example AIMD but there are
more like Vegas, Gradient...).
✘ Adaptive concurrency.
✘ Static queue timeout and priority.
✘ No configuration required.
✘ Adapts based on execution results (errors and latency).
Good
✘ Load shedding.
✘ Recover faster.
✘ No configuration.
✘ Adapts to any environment (Hardware, autoscaling,
noisy neighbor...)
Bad
✘ Depending on load and algorithm can be slow to
adapt (or not adapt).
Slide 72
Slide 72 text
Conclusions
Slide 73
Slide 73 text
Conclusions
✘ There is no winner, depends on the app and context.
✘ There is a loser: no protection (naked server).
✘ Adaptive algorithms add complexity (use libraries like
goresilience) but are better for dynamic envs like cloud
native.
✘ A bulkhead or circuit breaker can be enough.
✘ You can use a front proxy (or use sidecar pattern)
✘ Don’t trust your clients, protect yourself.
Slide 74
Slide 74 text
THANKS!
Any questions?
You can find me at
✘ @slok69
✘ https://slok.dev
✘ github.com/slok
http://bit.ly/resilience-form