Resilient functional service design

Resilient functional service design The usually forgotten parts of resilient
software design (updated 2021 edition) Uwe Friedrichsen – codecentric AG – 2015-2021

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

What is that “resilience” thing?

re·sil·ience (rĭ-zĭl′yəns) n. 1. The ability to recover quickly from
illness, change, or misfortune; buoyancy. 2. The property of a material that enables it to resume its original shape or position after being bent, stretched, or compressed; elasticity. American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved. https://www.thefreedictionary.com/resilience

What has that got to do with IT?

Well, it depends on what you are looking at

The comprehensive view 1

Resilience is a huge topic with many aspects …

… which can also be applied to IT

Resilience in IT Impact area • Infrastructure • Application •
IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Failure modes • Known failure modes • Unknown failure modes Value • Avoid adversity • Withstand adversity • Recover from adversity Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) Levels 2 • Resilience • Adaptability • Transformability 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Affected system • Technical (IT) • Socio-technical • Social (humans)

Resilience in IT Impact area • Infrastructure • Application •
IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Value • Avoid adversity • Withstand adversity • Recover from adversity Affected system • Technical (IT) • Socio-technical • Social (humans) Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability Levels 2 • Resilience • Adaptability • Transformability Failure modes • Known failure modes • Unknown failure modes 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Focus of this presentation

Known Unknown System Technical Social Failures Resilience in IT Chaos
Engineering Other types of resilience and increased impact area Fault-tolerant software design Focus of this presentation

The systems perspective 2

It is all about production!

Business value Systems in production is delivered by Systems are
available but only if

What is the problem? Let us install our software on
some HA hardware or infrastructure and everything will be fine.

For a single, monolithic, isolated system this might indeed work,
but …

(Almost) every system is a distributed system. -- Chas Emerick
http://www.infoq.com/presentations/problems-distributed-systems

Distributed systems in a nutshell

Everything fails, all the time. -- Werner Vogels

Failures in distributed systems ... • Crash failure • Omission
failure • Timing failure • Response failure • Byzantine failure

... lead to a variety of effects … • Lost
messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...

... turning seemingly simple issues into very hard ones Time
& Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"

A distributed system is one in which the failure of
a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport

Failures in todays complex, distributed and interconnected systems are not
the exception. • They are the normal case • They are not predictable

… and it’s getting “worse” • Cloud-based systems • Service-based
architectures • Zero Downtime • Mobile & IoT • Social Web à Ever-increasing complexity and connectivity

Do not try to avoid failures Embrace them

re·sil·ience (of IT systems) n. The ability of a system
to handle unexpected situations • without the user noticing it (ideal case) • with a graceful degradation of service (non-ideal case) The cautious attempt to provide a useful definition for resilience in the context of software systems. No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.

Typical measures • High-availability (HA) hardware/software • Only applicable for
very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design

Typical measures • High-availability (HA) hardware/software • Only applicable for
very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design Focus of this presentation

Designing for resilience …

First you learn about resilience …

Core Detect Treat Prevent Recover Mitigate Complement

Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy
Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement

… then you digest the stuff you have learned

Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy
Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement Cool! Give me more! More cool stuff Nice! Sounds like theory Boring! One-off? Whatever! Well … Later! Well … Maybe!

Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as
perceived by engineer Actual relevance for system robustness

Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as
perceived by engineer Actual relevance for system robustness Ye be warned! If you do not get this part right, nothing else matters ‘ere be dragons! This part is extremely hard and poorly understood

We have a problem!

Let us take a closer look at the core parts

Core Detect Treat Prevent Recover Mitigate Complement

Core Detect Treat Prevent Recover Mitigate Complement Isolation

Isolation • System must not fail as a whole •
Split system in parts and isolate parts against each other • Avoid cascading failures • Foundation of resilient software design

Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit

Failure unit • Core isolation pattern • a.k.a. “units of
mitigation” • Functional slices, isolated via bulkheads • Diverse implementation choices available • (Micro-)service, actor, CSP, SCS, ... • Choice impacts system and resilience design a lot • Shaping good failure units is extremely hard • Pure design issue

Really? Sounds easy! Where is the problem?

Service A Service B Request Due to functional design, Service
A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design How do we avoid this …

… and this … Service Request Due to functional design
we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service

Mothership Service (a.k.a. Monolith) Request By trying to avoid the
aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … without ending up with this?

Let us apply the well-known design best practices!

Well-known design best practices • Divide & conquer a.k.a. functional
decomposition • DRY (Don’t Repeat Yourself) • Design for reusability • Layered architecture • …

Unfortunately …

A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design … this usually leads to this …

… and this … Service Request Due to functional design
we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service

Mothership Service (a.k.a. Monolith) Request By trying to avoid the
aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … and in the end also often to this.

Welcome to distributed hell!

Caches to the rescue!

A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design Cache of B Break tight service coupling by caching data/responses of downstream service

Do you really think that copying stale data all over
your system is a suitable measure to fix an inherently broken design? * * Caches are a great measure in several places, but they do not make up for a messed-up design

Sorry, not that easy

We must re-learn functional design for distributed systems!

Okay, where is the silver bullet?

Sorry, out of stock * * as always

Maybe some 5-minute crash course, some checklist, …?

Sorry, not that easy

It is a lot of hard work …

… and there is no shortcut

Yet, a few guiding thoughts regarding bulkhead design * *
not to be confused with silver bullets – it will still be hard work

Foundations of design • “High cohesion, low coupling” & “separation
of concerns” • Crucial across process boundaries • Still poorly understood issue • Start with • Understanding organizational boundaries • Understanding use cases and flows • Identifying functional domains (à DDD) • Finding areas that change independently • Do not start with a data model!

Short activation paths • Long activation paths affect availability •
Increase likelihood of failures • Minimize remote calls per request • Need to balance opposing forces • Avoid monolith à clear separation of concerns • Minimize requests à cluster functionality & data • Caches can sometimes help, but stale data as trade-off

For good service design, look at the behavior first, not
the data (see https://speakerdeck.com/ufried/getting-service-design-right for more details)

Be (extremely) wary of reusability • Reusability increases coupling •
Reusability usually leads to bad service design • Reusability compromises availability • Reusability rarely pays • Do not strive for reusability in distributed systems • Strive for replaceability instead • Try to tackle reusability issues with libraries

Aiming for reusability in a distributed system compromises system availability
and response times by design

The value of reuse unfolds inside a process boundary Across
process boundaries it should be avoided as it maximizes coupling (see https://speakerdeck.com/ufried/the-reusability-fallacy for more details)

Extending the options

Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit
Communication paradigm

Communication paradigm • Request-response <-> messaging <-> events <-> …
• Heavily influences resilience patterns to be used • Also heavily influences functional failure unit design • Very fundamental decision which is often underestimated

Request/Response : Horizontal slicing Flow / Process Event-driven : Vertical
slicing Flow / Process

Request/Response Event-driven A draw … Complexity Built-in logic & orchestration
Coordination Event chains & choreography Vertically (divide & conquer) Decomposition Horizontally (go-with-the-flow) Built-in error handling Error handling Escalation/supervision strategy Multiple responsibilities / service Separation of concerns Single responsibilities / service Transactions Built-in transaction handling External supervision Distributed domain logic / reuse Encapsulation Local domain logic / replaceability

Workshop example: Order fulfillment

The request/response communication sample solution

Online Shop Warehouse System <Foreign Service> <Own Service> Coupon Management
Campaign Management Account Service PayPal Loyalty Management Accounts Receivables Music Library E-Book Library E-Mail Server Order Fulfillment Service Coordinate Shipment Service Warehouse Coordinate Assets Notify Cust. Payment Provider Credit Card PayPal Payment Service Promotion Loyalty Coupon Coordinate Video Library Credit Card Provider

The event-based communication sample solution

Order confirmed Online Shop Credit Card Provider Warehouse System <Foreign
Service> <Own Service> Coupon Management Campaign Management Account service Credit Card Service Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server PayPal PayPal Service Warehouse Service Promotion Service Bonus Card Service Coupon Service Music Library Service Video Library Service E-Book Library Service Notification Service Payment authorized Digital asset provisioned Payment failed <Event> Order fulfillment supervisor Track flow of events Reschedule events in case of failure Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that

The communication paradigm influences the functional service design a lot
and the resilience patterns to be used, too

Do not limit your design options upfront without an important
reason

Wrapping up

Wrapping up • Today’s systems are distributed • Failures are
neither avoidable nor predictable • Resilient software design needed • Functional design of failure units (“services”) is • crucial for application robustness • poorly understood • massively underrated • very different from traditional design best practices • Communication paradigms extend the design options

We must re-learn functional design for distributed systems!

Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

Resilient functional service design

Resilient functional service design

More Decks by Uwe Friedrichsen

Other Decks in Technology

Featured

Transcript