Slide 1

Slide 1 text

Resilient functional service design The usually forgotten parts of resilient software design (updated 2021 edition) Uwe Friedrichsen – codecentric AG – 2015-2021

Slide 2

Slide 2 text

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

Slide 3

Slide 3 text

What is that “resilience” thing?

Slide 4

Slide 4 text

re·sil·ience (rĭ-zĭl′yəns) n. 1. The ability to recover quickly from illness, change, or misfortune; buoyancy. 2. The property of a material that enables it to resume its original shape or position after being bent, stretched, or compressed; elasticity. American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved. https://www.thefreedictionary.com/resilience

Slide 5

Slide 5 text

What has that got to do with IT?

Slide 6

Slide 6 text

Well, it depends on what you are looking at

Slide 7

Slide 7 text

The comprehensive view 1

Slide 8

Slide 8 text

Resilience is a huge topic with many aspects …

Slide 9

Slide 9 text

… which can also be applied to IT

Slide 10

Slide 10 text

Resilience in IT Impact area • Infrastructure • Application • IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Failure modes • Known failure modes • Unknown failure modes Value • Avoid adversity • Withstand adversity • Recover from adversity Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) Levels 2 • Resilience • Adaptability • Transformability 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Affected system • Technical (IT) • Socio-technical • Social (humans)

Slide 11

Slide 11 text

Resilience in IT Impact area • Infrastructure • Application • IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Value • Avoid adversity • Withstand adversity • Recover from adversity Affected system • Technical (IT) • Socio-technical • Social (humans) Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability Levels 2 • Resilience • Adaptability • Transformability Failure modes • Known failure modes • Unknown failure modes 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Focus of this presentation

Slide 12

Slide 12 text

Known Unknown System Technical Social Failures Resilience in IT Chaos Engineering Other types of resilience and increased impact area Fault-tolerant software design Focus of this presentation

Slide 13

Slide 13 text

The systems perspective 2

Slide 14

Slide 14 text

It is all about production!

Slide 15

Slide 15 text

Business value Systems in production is delivered by Systems are available but only if

Slide 16

Slide 16 text

What is the problem? Let us install our software on some HA hardware or infrastructure and everything will be fine.

Slide 17

Slide 17 text

For a single, monolithic, isolated system this might indeed work, but …

Slide 18

Slide 18 text

(Almost) every system is a distributed system. -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems

Slide 19

Slide 19 text

Distributed systems in a nutshell

Slide 20

Slide 20 text

Everything fails, all the time. -- Werner Vogels

Slide 21

Slide 21 text

Failures in distributed systems ... • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure

Slide 22

Slide 22 text

... lead to a variety of effects … • Lost messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...

Slide 23

Slide 23 text

... turning seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"

Slide 24

Slide 24 text

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport

Slide 25

Slide 25 text

Failures in todays complex, distributed and interconnected systems are not the exception. • They are the normal case • They are not predictable

Slide 26

Slide 26 text

… and it’s getting “worse” • Cloud-based systems • Service-based architectures • Zero Downtime • Mobile & IoT • Social Web à Ever-increasing complexity and connectivity

Slide 27

Slide 27 text

Do not try to avoid failures Embrace them

Slide 28

Slide 28 text

re·sil·ience (of IT systems) n. The ability of a system to handle unexpected situations • without the user noticing it (ideal case) • with a graceful degradation of service (non-ideal case) The cautious attempt to provide a useful definition for resilience in the context of software systems. No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.

Slide 29

Slide 29 text

Typical measures • High-availability (HA) hardware/software • Only applicable for very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design

Slide 30

Slide 30 text

Typical measures • High-availability (HA) hardware/software • Only applicable for very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design Focus of this presentation

Slide 31

Slide 31 text

Designing for resilience …

Slide 32

Slide 32 text

First you learn about resilience …

Slide 33

Slide 33 text

Core Detect Treat Prevent Recover Mitigate Complement

Slide 34

Slide 34 text

Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement

Slide 35

Slide 35 text

… then you digest the stuff you have learned

Slide 36

Slide 36 text

Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement Cool! Give me more! More cool stuff Nice! Sounds like theory Boring! One-off? Whatever! Well … Later! Well … Maybe!

Slide 37

Slide 37 text

Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as perceived by engineer Actual relevance for system robustness

Slide 38

Slide 38 text

Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as perceived by engineer Actual relevance for system robustness Ye be warned! If you do not get this part right, nothing else matters ‘ere be dragons! This part is extremely hard and poorly understood

Slide 39

Slide 39 text

We have a problem!

Slide 40

Slide 40 text

Let us take a closer look at the core parts

Slide 41

Slide 41 text

Core Detect Treat Prevent Recover Mitigate Complement

Slide 42

Slide 42 text

Core Detect Treat Prevent Recover Mitigate Complement Isolation

Slide 43

Slide 43 text

Isolation • System must not fail as a whole • Split system in parts and isolate parts against each other • Avoid cascading failures • Foundation of resilient software design

Slide 44

Slide 44 text

Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit

Slide 45

Slide 45 text

Failure unit • Core isolation pattern • a.k.a. “units of mitigation” • Functional slices, isolated via bulkheads • Diverse implementation choices available • (Micro-)service, actor, CSP, SCS, ... • Choice impacts system and resilience design a lot • Shaping good failure units is extremely hard • Pure design issue

Slide 46

Slide 46 text

Really? Sounds easy! Where is the problem?

Slide 47

Slide 47 text

Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design How do we avoid this …

Slide 48

Slide 48 text

… and this … Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service

Slide 49

Slide 49 text

Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … without ending up with this?

Slide 50

Slide 50 text

Let us apply the well-known design best practices!

Slide 51

Slide 51 text

Well-known design best practices • Divide & conquer a.k.a. functional decomposition • DRY (Don’t Repeat Yourself) • Design for reusability • Layered architecture • …

Slide 52

Slide 52 text

Unfortunately …

Slide 53

Slide 53 text

Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design … this usually leads to this …

Slide 54

Slide 54 text

… and this … Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service

Slide 55

Slide 55 text

Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … and in the end also often to this.

Slide 56

Slide 56 text

Welcome to distributed hell!

Slide 57

Slide 57 text

Caches to the rescue!

Slide 58

Slide 58 text

Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design Cache of B Break tight service coupling by caching data/responses of downstream service

Slide 59

Slide 59 text

Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Caches are a great measure in several places, but they do not make up for a messed-up design

Slide 60

Slide 60 text

Sorry, not that easy

Slide 61

Slide 61 text

We must re-learn functional design for distributed systems!

Slide 62

Slide 62 text

Okay, where is the silver bullet?

Slide 63

Slide 63 text

Sorry, out of stock * * as always

Slide 64

Slide 64 text

Maybe some 5-minute crash course, some checklist, …?

Slide 65

Slide 65 text

Sorry, not that easy

Slide 66

Slide 66 text

It is a lot of hard work …

Slide 67

Slide 67 text

… and there is no shortcut

Slide 68

Slide 68 text

Yet, a few guiding thoughts regarding bulkhead design * * not to be confused with silver bullets – it will still be hard work

Slide 69

Slide 69 text

Foundations of design • “High cohesion, low coupling” & “separation of concerns” • Crucial across process boundaries • Still poorly understood issue • Start with • Understanding organizational boundaries • Understanding use cases and flows • Identifying functional domains (à DDD) • Finding areas that change independently • Do not start with a data model!

Slide 70

Slide 70 text

Short activation paths • Long activation paths affect availability • Increase likelihood of failures • Minimize remote calls per request • Need to balance opposing forces • Avoid monolith à clear separation of concerns • Minimize requests à cluster functionality & data • Caches can sometimes help, but stale data as trade-off

Slide 71

Slide 71 text

For good service design, look at the behavior first, not the data (see https://speakerdeck.com/ufried/getting-service-design-right for more details)

Slide 72

Slide 72 text

Be (extremely) wary of reusability • Reusability increases coupling • Reusability usually leads to bad service design • Reusability compromises availability • Reusability rarely pays • Do not strive for reusability in distributed systems • Strive for replaceability instead • Try to tackle reusability issues with libraries

Slide 73

Slide 73 text

Aiming for reusability in a distributed system compromises system availability and response times by design

Slide 74

Slide 74 text

The value of reuse unfolds inside a process boundary Across process boundaries it should be avoided as it maximizes coupling (see https://speakerdeck.com/ufried/the-reusability-fallacy for more details)

Slide 75

Slide 75 text

Extending the options

Slide 76

Slide 76 text

Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit Communication paradigm

Slide 77

Slide 77 text

Communication paradigm • Request-response <-> messaging <-> events <-> … • Heavily influences resilience patterns to be used • Also heavily influences functional failure unit design • Very fundamental decision which is often underestimated

Slide 78

Slide 78 text

Request/Response : Horizontal slicing Flow / Process Event-driven : Vertical slicing Flow / Process

Slide 79

Slide 79 text

Request/Response Event-driven A draw … Complexity Built-in logic & orchestration Coordination Event chains & choreography Vertically (divide & conquer) Decomposition Horizontally (go-with-the-flow) Built-in error handling Error handling Escalation/supervision strategy Multiple responsibilities / service Separation of concerns Single responsibilities / service Transactions Built-in transaction handling External supervision Distributed domain logic / reuse Encapsulation Local domain logic / replaceability

Slide 80

Slide 80 text

Workshop example: Order fulfillment

Slide 81

Slide 81 text

The request/response communication sample solution

Slide 82

Slide 82 text

Online Shop Warehouse System Coupon Management Campaign Management Account Service PayPal Loyalty Management Accounts Receivables Music Library E-Book Library E-Mail Server Order Fulfillment Service Coordinate Shipment Service Warehouse Coordinate Assets Notify Cust. Payment Provider Credit Card PayPal Payment Service Promotion Loyalty Coupon Coordinate Video Library Credit Card Provider

Slide 83

Slide 83 text

The event-based communication sample solution

Slide 84

Slide 84 text

Order confirmed Online Shop Credit Card Provider Warehouse System Coupon Management Campaign Management Account service Credit Card Service Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server PayPal PayPal Service Warehouse Service Promotion Service Bonus Card Service Coupon Service Music Library Service Video Library Service E-Book Library Service Notification Service Payment authorized Digital asset provisioned Payment failed Order fulfillment supervisor Track flow of events Reschedule events in case of failure Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that

Slide 85

Slide 85 text

The communication paradigm influences the functional service design a lot and the resilience patterns to be used, too

Slide 86

Slide 86 text

Do not limit your design options upfront without an important reason

Slide 87

Slide 87 text

Wrapping up

Slide 88

Slide 88 text

Wrapping up • Today’s systems are distributed • Failures are neither avoidable nor predictable • Resilient software design needed • Functional design of failure units (“services”) is • crucial for application robustness • poorly understood • massively underrated • very different from traditional design best practices • Communication paradigms extend the design options

Slide 89

Slide 89 text

We must re-learn functional design for distributed systems!

Slide 90

Slide 90 text

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/