Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilience patterns (Server edition)

Resilience patterns (Server edition)

Talk about server-based resilience patterns

Xabier Larrakoetxea

March 20, 2019
Tweet

More Decks by Xabier Larrakoetxea

Other Decks in Technology

Transcript

  1. Resilience It is the ability to absorb or avoid damage

    without suffering complete failure. Wikipedia
  2. Little’s law Is a theorem by John Little which states

    that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system Wikipedia
  3. Little’s law Is a theorem by John Little which states

    that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system Wikipedia
  4. R R R R R R R R R Concurrency

    Queue L W λ L = λ W
  5. Our server can handle ~ 100 concurrent requests (due to

    any of these: CPU, Memory, fixed concurrency, threads, file descriptors….) Example
  6. Load shedding Is a technique used in information systems, especially

    web services, to avoid overloading the system and making it unavailable for all users. The idea is to ignore some requests rather than crashing a system and making it fail to serve any request. Wikipedia
  7. The tests ✘ Handler with random latency (250-600ms) ✘ 15-25RPS

    is the regular capacity ✘ Test1: 15RPS for 3m (with 50RPS 1m middle spike) ✘ Test2: 60RPS for 15m ✘ Limits to 0.1 CPU and 50mb on memory ✘ Goresilience library (github.com/slok/goresilience) The tests demo can be found at https://github.com/slok/resilience-demo
  8. Exp1: Naked server ✘ Accepts everything. ✘ No protection against

    spikes/bursts. ✘ The least recommended one. ✘ The most used one.
  9. ✘ Success: 61% ✘ P95: 15.8s ✘ P99: 24.7s 15

    RPS with middle spike (50RPS) Unresponsive on spike
  10. App (without bulkhead) Bulkhead pattern Isolate elements of an application

    into pools so that if one fails, the others will continue to function. Service A Service B App (with bulkheads) Service A Service B
  11. Exp2: Bulkhead ✘ Bulkhead will limit the concurrent handlings. ✘

    Queue timeout will clean long time queued requests. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes.
  12. ✘ Success: 83% ✘ P95: 6.5s ✘ P99: 8.5s 15

    RPS with middle spike (50RPS)
  13. P90 < 10s Controlled failures (429) ✘ Success: 38.75% ✘

    P95: 9.3s ✘ P99: 12.2s 60 RPS (15m)
  14. Naked Bulkhead 15RPS Success: 61% P95: 15.8s P99: 24.7s Success:

    83% P95: 6.5s P99: 8.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s
  15. Good ✘ Simple ✘ Load shedding Bad ✘ Static configuration

    (requires load tests) ✗ Timeouts ✗ Number of workers (concurrency) ✘ Wrong configuration could waste capacity or overload the server
  16. Circuit breaker pattern It is used to detect failures and

    encapsulates the logic of preventing a failure from constantly recurring, during maintenance, temporary external system failure or unexpected system difficulties. Closed Open Half open Error limit exceeded Timeout Tests failed Tests succeeded Regular flow Fail fast Regular flow
  17. Exp3: Bulkhead + circuit breaker ✘ Bulkhead will limit the

    concurrent handlings. ✘ Circuit breaker will release the load (fast). ✘ Hystrix style pattern. ✘ Needs to be configured . ✘ Will protect us from bursts/spikes. ✘ Circuit breaker wraps bulkhead.
  18. Less latency than plain bulkhead Circuit breaker opened (fail in

    spikes) ✘ Success: 30% ✘ P95: 6.5s ✘ P99: 8s 60 RPS (15m)
  19. Naked Bulkhead Bk + CB 15RPS Success: 61% P95: 15.8s

    P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s
  20. Good ✘ Load shedding ✘ Recover faster than bulkhead Bad

    ✘ Static configuration (requires load tests) ✘ Fail requests although server is ok
  21. LIFO + CoDel (Facebook) Unfair request handling and aggressive timeouts

    on congestion Explanation and algorithm: https://queue.acm.org/detail.cfm?id=2839461 Original CoDel: https://queue.acm.org/detail.cfm?id=2209336 Airbnb uses also: https://medium.com/airbnb-engineering/building-services-at-airbnb-part-3-ac6d4972fc2d
  22. CoDel (Controlled delay) We have 2 kinds of in queue

    timeouts: ✘ Regular (Interval): 100ms by default ✘ Aggressive (target): 5ms by default By default requests will have the interval timeout on queue. Measure when the queue was empty for the last time. If the duration since last time is greater than interval duration, congestion detected. If congested the requests will have the target timeout on queue.
  23. CoDel (Controlled delay) // Enqueue request. if (queue.timeSinceEmpty() > interval)

    { // Congestion. timeout = target } else { timeout = interval } queue.enqueue(req, timeout)
  24. Adaptive LIFO We have 2 kinds of in queue priorities:

    ✘ FIFO: On regular mode first in first out. ✘ LIFO: On congestion mode last in first out (unfair). When CoDel detects congestion it will change queue dequeue priority and the last requests will be served first (CoDel will clean old queued requests). The algorithm assumes that delayed queued requests are gone already and the new ones have more probability of being served. Dropbox proxy uses adaptive LIFO : https://blogs.dropbox.com/tech/2018/03/meet-bandaid-the-dropbox-service-proxy
  25. Adaptive LIFO Low load High load First in First out

    Last in First out CoDel will drain if persists
  26. R R R R R R R R R Concurrency

    Queue No congestion In
  27. R R R R R R R R R Concurrency

    Queue Congestion In
  28. Exp4: Adaptive LIFO + CoDel ✘ Based on TCP congestion

    CoDel algorithm and adapted by Facebook. ✘ Dynamic timeout and queue priority (Will adapt and change policies on congestion) . ✘ No configuration required (almost, safe defaults). ✘ Very aggressive timeouts on congestion.
  29. ✘ Success: 78% ✘ P95: 3.1s ✘ P99: 5.4s 15

    RPS with middle spike (50RPS) Triggered congestion
  30. Naked Bulkhead Bk + CB CoDel 15RPS Success: 61% P95:

    15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s
  31. Good ✘ Load shedding. ✘ Recover faster than circuit breaker.

    ✘ Unfair serving (serve new ones, leave old ones). ✘ No configuration Bad ✘ Static concurrency config (doesn’t affect much).
  32. Concurrency limits Implements and integrates concepts from TCP congestion control

    to auto-detect concurrency limits for services in order to achieve optimal throughput with optimal latency. Concurrent requests Time Initial limit Discovered limit Real limit
  33. Exp5: Adaptive concurrency (limits) ✘ Different algorithms (in this example

    AIMD but there are more like Vegas, Gradient...). ✘ Adaptive concurrency. ✘ Static queue timeout and priority. ✘ No configuration required. ✘ Adapts based on execution results (errors and latency).
  34. ✘ Success: 82% ✘ P95: 4s ✘ P99: 6s 15

    RPS with middle spike (50RPS)
  35. Naked Bulkhead Bk + CB CoDel CL 15RPS Success: 61%

    P95: 15.8s P99: 24.7s Success: 83% P95: 6.5s P99: 8.5s Success: 76.96% P95: 5.2s P99: 7.5s Success: 78% P95: 3.1s P99: 5.4s Success: 82% P95: 4s P99: 6s 60RPS Success: 9.62% P95: 24s P99: 29s Success: 38.75% P95: 9.3s P99: 12.2s Success: 30% P95: 6.5s P99: 8s Success: 30% P95: 6s P99: 8s Success: 36% P95: 4.8s P99: 6.5s
  36. Good ✘ Load shedding. ✘ Recover faster. ✘ No configuration.

    ✘ Adapts to any environment (Hardware, autoscaling, noisy neighbor...) Bad ✘ Depending on load and algorithm can be slow to adapt (or not adapt).
  37. Conclusions ✘ There is no winner, depends on the app

    and context. ✘ There is a loser: no protection (naked server). ✘ Adaptive algorithms add complexity (use libraries like goresilience) but are better for dynamic envs like cloud native. ✘ A bulkhead or circuit breaker can be enough. ✘ You can use a front proxy (or use sidecar pattern) ✘ Don’t trust your clients, protect yourself.
  38. THANKS! Any questions? You can find me at ✘ @slok69

    ✘ https://slok.dev ✘ github.com/slok http://bit.ly/resilience-form