Slide 1

Slide 1 text

10 patterns for more resilient applications A gentle start into resilient software design Uwe Friedrichsen – codecentric AG – 2014-2022

Slide 2

Slide 2 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

Slide 3

Slide 3 text

What is that “resilience” thing?

Slide 4

Slide 4 text

re·sil·ience (rĭ-zĭl′yəns) n. 1. The ability to recover quickly from illness, change, or misfortune; buoyancy. 2. The property of a material that enables it to resume its original shape or position after being bent, stretched, or compressed; elasticity. American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved. https://www.thefreedictionary.com/resilience

Slide 5

Slide 5 text

What does it mean for IT systems?

Slide 6

Slide 6 text

re·sil·ience (of IT systems) n. The ability of a system to handle unexpected situations • without the user noticing it (ideal case) • with a graceful degradation of service and quick recovery to normal operations (non-ideal case) The cautious attempt to provide a useful definition for resilience in the context of software systems. No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.

Slide 7

Slide 7 text

Can’t we just leave it to ops as we did it in the past?

Slide 8

Slide 8 text

What is the problem? Let ops run our software on some HA infrastructure or alike and everything will be fine.

Slide 9

Slide 9 text

Sorry, not that easy anymore

Slide 10

Slide 10 text

But why?

Slide 11

Slide 11 text

For a single, monolithic, isolated system this might indeed work, but …

Slide 12

Slide 12 text

(Almost) every system is a distributed system. -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems

Slide 13

Slide 13 text

The software you develop and maintain is most likely part of a (big) distributed system landscape

Slide 14

Slide 14 text

Distributed systems in a nutshell

Slide 15

Slide 15 text

Everything fails, all the time. -- Werner Vogels

Slide 16

Slide 16 text

Failures in distributed systems ... • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure

Slide 17

Slide 17 text

... lead to a variety of effects … • Lost messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...

Slide 18

Slide 18 text

... turning seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"

Slide 19

Slide 19 text

Embracing distributed systems • Distributed systems introduce non-determinism regarding • Execution completeness • Message ordering • Communication timing • You will be affected by this at the application level • Don’t expect your infrastructure to hide all effects from you • Better know how to detect if it hit you and how to respond

Slide 20

Slide 20 text

Okay, I buy it. But how do I start?

Slide 21

Slide 21 text

Let us start simple … * * which often improves the situation amazingly much

Slide 22

Slide 22 text

Let us create our starter’s toolbox

Slide 23

Slide 23 text

Resilience starter’s toolbox

Slide 24

Slide 24 text

Accessing other systems

Slide 25

Slide 25 text

Resilience starter’s toolbox Accessing other systems (downstream)

Slide 26

Slide 26 text

from urllib3 import PoolManager URL = <…> http = PoolManager() r = http.request('GET’, URL) https://github.com/urllib3/urllib3

Slide 27

Slide 27 text

Failure type Resilience starter’s toolbox Brittle connection (omission failure) Slow response (timing failure) Accessing other systems (downstream) Wrong response (response failure) No response (crash failure) The other system does not respond at all The other system does not respond reliably It takes too long until the other system responds The other system responds, but the response is not okay

Slide 28

Slide 28 text

Detection Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Slow response (timing failure) Accessing other systems (downstream) Wrong response (response failure) Error checking

Slide 29

Slide 29 text

Error checking • Most basic error detection pattern • Yet too often neglected • Multiple implementation variants • Exception handling (Java, C++, …) • Return code checking (C, …) • Extra error return value (Go, …) • Thorough error checking tends to make code harder to read

Slide 30

Slide 30 text

Detection Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Slow response (timing failure) Accessing other systems (downstream) Wrong response (response failure) Error checking Timeout

Slide 31

Slide 31 text

Timeout • Preserve responsiveness independent of downstream latency • Essential error detection pattern • Crucial if using synchronous communication • Also needed if using asynchronous request/response style • Good library support in most programming languages

Slide 32

Slide 32 text

Detection Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Slow response (timing failure) Accessing other systems (downstream) Wrong response (response failure) Error checking Timeout Circuit breaker

Slide 33

Slide 33 text

Circuit breaker • Probably most often cited resilience pattern • Extension of the timeout pattern • Takes downstream unit offline if calls fail multiple times • Can be used for most failure types • Crash failures, omission failure, timing failures • Many implementations available

Slide 34

Slide 34 text

Adding error and timeout detection

Slide 35

Slide 35 text

from urllib3 import PoolManager URL = <…> http = PoolManager() r = http.request('GET’, URL) https://github.com/urllib3/urllib3

Slide 36

Slide 36 text

from concurrent.futures import ThreadPoolExecutor, TimeoutError from urllib3 import PoolManager from urllib3.exceptions import HTTPError URL = 'http://httpbin.org/delay/2' def get_url(http, url): return http.request('GET', url) http = PoolManager() with ThreadPoolExecutor(max_workers=1) as executor: future = executor.submit(get_url, http, URL) try: r = future.result(timeout=0.5) except TimeoutError: print('Request timed out') future.cancel() except HTTPError: print('An error occurred') else: print('Received:', r.data)

Slide 37

Slide 37 text

from urllib3 import PoolManager from urllib3.exceptions import HTTPError URL = 'http://httpbin.org/delay/2' http = PoolManager() try: r = http.request('GET', URL, timeout=0.5) except HTTPError: print('An error occurred or request timed out') else: print('Received:', r.data)

Slide 38

Slide 38 text

Detection Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Slow response (timing failure) Accessing other systems (downstream) Wrong response (response failure) Error checking Timeout Response value checking Circuit breaker

Slide 39

Slide 39 text

Response value checking • As obvious as it sounds, yet often neglected • Protection from broken/malicious return values • Especially do not forget to check for Null values • Quite good library support • But often do not cover all checks needed • Consider specific data types

Slide 40

Slide 40 text

Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Circuit breaker

Slide 41

Slide 41 text

Retry • Basic recovery pattern for downstream calls • Recover from omission or other transient errors • Limit retries to minimize extra load on an overloaded resource • Limit retries to avoid recurring errors • Some library support available

Slide 42

Slide 42 text

Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Circuit breaker Backup request

Slide 43

Slide 43 text

Backup request • Send request to multiple workers (usually with some delay) • Use quickest reply and discard all other responses • Prevents latent responses (or at least reduces probability) • Requires redundancy – trades resources for availability also see: J. Dean, L. A. Barroso, “The tail at scale”, Communications of the ACM, Vol. 56 No. 2

Slide 44

Slide 44 text

Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Caching Circuit breaker Backup request

Slide 45

Slide 45 text

Caching • Re-use responses from prior calls to downstream resources • Can bridge temporary unavailability of resources • Use with caution • Requires extra resources to store cached data • Leaves you with potentially stale data and all consistency issues associated with it • Good tool and library support

Slide 46

Slide 46 text

Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Fallback Caching Circuit breaker Backup request

Slide 47

Slide 47 text

Fallback • Execute an alternative action if the original action fails • Basis for most mitigation patterns • Widespread simple variants • Fail silently: silently ignore error and continue processing • Default value: return predefined default value if error occurs • Note that fallback action is a business decision

Slide 48

Slide 48 text

Adding retry and fallback

Slide 49

Slide 49 text

from urllib3 import PoolManager from urllib3.exceptions import HTTPError URL = 'http://httpbin.org/delay/2' http = PoolManager() try: r = http.request('GET', URL, timeout=0.5) except HTTPError: print('An error occurred or request timed out') else: print('Received:', r.data)

Slide 50

Slide 50 text

from urllib3 import PoolManager from urllib3.exceptions import HTTPError URL = 'http://httpbin.org/delay/2' http = PoolManager() def get_url(http, url): try: r = http.request('GET', url, timeout=0.5) except HTTPError: return None # None means something went wrong else: return r.data d = get_url(http, URL) if d is None: d = get_url(http, URL) # Retry once if d is None: d = 42 # Execute fallback print('Received:’, d)

Slide 51

Slide 51 text

from urllib3 import PoolManager from urllib3.exceptions import HTTPError URL = 'http://httpbin.org/delay/2' http = PoolManager() try: r = http.request('GET', URL, timeout=0.5, retries=1) except HTTPError: d = 42 # Execute fallback else: d = r.data print('Received:', d)

Slide 52

Slide 52 text

Response w/ redundancy Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Fallback Caching Failover Circuit breaker Backup request

Slide 53

Slide 53 text

Failover • Used if simpler recovery measures fail or take too long • Many implementation variants available • Good support on the infrastructure level • Recovery and state replication usually not covered • Mind the business case • Requires redundancy – trades resources for availability • Added costs need to justify added value

Slide 54

Slide 54 text

Response w/ redundancy Response w/o redundancy Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Slow response (timing failure) Error checking Timeout Response value checking Accessing other systems (downstream) Wrong response (response failure) Retry Fallback Caching Failover Custom response Circuit breaker Postel’s law Backup request

Slide 55

Slide 55 text

Remember Postel’s law “Be conservative in what you do, be liberal in what you accept from others” (Often reworded as: “Be conservative in what you send, be liberal in what you accept”) see also: https://en.wikipedia.org/wiki/Robustness_principle

Slide 56

Slide 56 text

Being accessed by other systems

Slide 57

Slide 57 text

Being accessed by other systems (upstream) Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Accessing other systems (downstream) Response w/ redundancy Wrong response (response failure) Custom response Circuit breaker Postel’s law Backup request

Slide 58

Slide 58 text

from fastapi import FastAPI app = FastAPI() @app.get("/square/{number}") def read_square(number): n = int(number) return {"result": n*n} https://fastapi.tiangolo.com/

Slide 59

Slide 59 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Wrong response (response failure) Custom response Circuit breaker Postel’s law The request parameters are not okay The other systems send too many requests Backup request

Slide 60

Slide 60 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Wrong response (response failure) Custom response Circuit breaker Postel’s law Request parameter checking Backup request

Slide 61

Slide 61 text

Request parameter checking • As obvious as it sounds, yet often neglected • Protection from broken/malicious request parameters • Especially do not forget to check for Null values • Quite good library support • But often do not cover all checks needed • Consider specific data types

Slide 62

Slide 62 text

Adding parameter checking

Slide 63

Slide 63 text

from fastapi import FastAPI app = FastAPI() @app.get("/square/{number}") def read_square(number): n = int(number) return {"result": n*n} https://fastapi.tiangolo.com/

Slide 64

Slide 64 text

from fastapi import FastAPI, Path app = FastAPI() @app.get("/square/{number}") def read_square(number: int = Path(..., gt=0, lt=100)): return {"result": number*number}

Slide 65

Slide 65 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Wrong response (response failure) Custom response Monitoring Circuit breaker Postel’s law Request parameter checking Backup request

Slide 66

Slide 66 text

Monitoring • Indispensable when running distributed systems • Good tool support available • Usually needs application-level support for best performance • Application-level and business-level metrics • Should be combined with self-healing measures • Alarms should only be sent if self-healing fails

Slide 67

Slide 67 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Monitoring Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Shed load Circuit breaker Postel’s law Backup request

Slide 68

Slide 68 text

Shed load • Limit load to keep throughput of resource acceptable • Reject (shed) requests (“rate limiting”) • Best shed load at periphery • Minimize impact on resource itself • Good tool support available • Usually requires monitoring data to watch load of resource • Try not to break ongoing multi-request sessions

Slide 69

Slide 69 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Monitoring Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Shed load Share load Circuit breaker Postel’s law Backup request

Slide 70

Slide 70 text

Share load • Share load between resources to keep throughput good • Use if additional resources for load sharing can be used • Can be implemented statically or dynamically (“auto-scaling”) • Very good tool support available • Minimize synchronization needed between resources • Synchronization needs kill scalability

Slide 71

Slide 71 text

Useful complementing patterns

Slide 72

Slide 72 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Circuit breaker Postel’s law Monitoring Backup request

Slide 73

Slide 73 text

Idempotency • Non-idempotent calls become very complicated if they fail • Idempotent calls can be repeated without problems • Always return the same result • Do not trigger any cumulating side-effects • Reduces coupling between nodes • Simplifies responding to most failure types a lot • Very fundamental resilience and scalability pattern

Slide 74

Slide 74 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Temporal decoupling Circuit breaker Postel’s law Monitoring Backup request

Slide 75

Slide 75 text

Temporal decoupling • Request, processing and response are temporally decoupled • Simplifies responding to timing failures a lot • Not necessary to recover from failures within caller’s response time expectations • Functional design issue • Technology only augments it • Enables simpler and more robust communication types • E.g., batch processing

Slide 76

Slide 76 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Temporal decoupling Circuit breaker Quorum based reads & writes Postel’s law Monitoring Backup request

Slide 77

Slide 77 text

Quorum-based reads and writes • Became popular with the rise of NoSQL databases • Useful pattern for distributed, replicated data stores • Relaxes consistency constraints while writing • Detects inconsistencies due to a (temporally) failed prior write • Not a replacement for response value checking • Not to be confused with ACID transactions

Slide 78

Slide 78 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Temporal decoupling Circuit breaker Quorum based reads & writes Postel’s law Graceful startup Monitoring Backup request

Slide 79

Slide 79 text

Graceful startup • Implement graceful startup mode • Wait until all required resources and services are available before switching to runtime mode • Makes application startup order interchangeable • Crucial for quick recovery after bigger failures • Simple and powerful, but often neglected pattern

Slide 80

Slide 80 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Temporal decoupling Circuit breaker Quorum based reads & writes Postel’s law Graceful startup Monitoring Fail fast Backup request

Slide 81

Slide 81 text

Fail fast • “If you know you’re going to fail, you better fail fast” • Usually implemented in front of costly actions • Saves time and resources by avoiding foreseeable failures • Useful in normal operations mode • Can be counterproductive in startup mode

Slide 82

Slide 82 text

What can we delegate to the infrastructure level?

Slide 83

Slide 83 text

Resilience starter’s toolbox No response (crash failure) Brittle connection (omission failure) Failure type Detection Response w/o redundancy Complement Slow response (timing failure) Wrong request (response failure) Error checking Timeout Fallback Caching Failover Retry Response value checking Shed load Share load Monitoring Idempotency Accessing other systems (downstream) Being accessed by other systems (upstream) Overload (timing failure) Response w/ redundancy Request parameter checking Wrong response (response failure) Custom response Temporal decoupling Circuit breaker Quorum based reads & writes Postel’s law Graceful startup Monitoring Fail fast Coarse-grained Generic Needs application-level metrics and interaction Coarse-grained Backup request

Slide 84

Slide 84 text

But that is still a lot to implement

Slide 85

Slide 85 text

But what should be the alternative?

Slide 86

Slide 86 text

Should we let the application crash whenever something goes wrong?

Slide 87

Slide 87 text

Always keep in mind …

Slide 88

Slide 88 text

The question is no longer, if failures will hit you The only question left is, when and how bad they will hit you

Slide 89

Slide 89 text

Thus, look for library and framework support … but do the work!

Slide 90

Slide 90 text

Everyone loves resilient applications

Slide 91

Slide 91 text

Wrap-up

Slide 92

Slide 92 text

Wrap-up • Distribution makes resilient software design mandatory • Failures will hit you at the application level • The starter’s toolbox • Delegate to the infrastructure what is possible • ... but consider the limitations • Look for library and framework support

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

Recommended readings Release It! Design and Deploy Production-Ready Software, Michael Nygard, 2nd edition, Pragmatic Bookshelf, 2018 Patterns for Fault Tolerant Software, Robert S. Hanmer, Wiley, 2007 Distributed Systems – Principles and Paradigms, Andrew Tanenbaum, Marten van Steen, 3rd Edition, 2017, https://www.distributed-systems.net/index.php/books/ds3/ On Designing and Deploying Internet-Scale Services, James Hamilton, 21st LISA Conference 2007 Site Reliability Engineering, Betsy Beyer et al., O’Reilly, 2016

Slide 95

Slide 95 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/