Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient functional service design

Resilient functional service design

This slide deck addresses the importance of proper functional design for creating resilient, distributed systems (not only, but also microservice-based systems).

It starts with a short introduction how resilient software design as discussed in this presentation maps into the much broader topic of resilience and why resilient software design is a relevant topic that IT development departments cannot afford to ignore anymore.

Next, the slide deck explains the pitfall many developers fall into when getting started with resilience: Quite often, the technical fault-tolerance implementation patterns are overrated while the foundations, i.e., designing failure units (these days typically in the form of microservices) on a functional level and choosing the communication paradigm in turn are underrated.

Then, it is explained why this inverted prioritization is a problem - that all implementation patterns do not help if you do not get the functional design including the distribution of functionality across the failure units right. Additionally, it is explained why the wide-spread design "best practices" do not help in this context, but instead make things even worse.

The core observation of the slide deck is that we basically need to re-learn functional design if we want to create robust distributed systems because the existing "best practices" that are great inside process boundaries are usually counterproductive across process boundaries.

The slide deck does not offer any silver bullets to solve the problem (and I believe there is no silver bullet) Instead it offers a few guiding principles. Additionally, it shows how the communication paradigm choice influences the failure unit design a lot, this way creating more options for a good service design that also supports resiliency on a functional level.

As always this slide deck is without the voice track, i.e., most of the information is probably missing. But I hope that the slides on their own also provide some helpful hints.

PS: This is an edited and updated version of an older presentation. The core message is the same, but quite some details were changed and updated.

Uwe Friedrichsen

July 29, 2021
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient functional service design The usually forgotten parts of resilient

    software design (updated 2021 edition) Uwe Friedrichsen – codecentric AG – 2015-2021
  2. re·sil·ience (rĭ-zĭl′yəns) n. 1. The ability to recover quickly from

    illness, change, or misfortune; buoyancy. 2. The property of a material that enables it to resume its original shape or position after being bent, stretched, or compressed; elasticity. American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved. https://www.thefreedictionary.com/resilience
  3. Resilience in IT Impact area • Infrastructure • Application •

    IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Failure modes • Known failure modes • Unknown failure modes Value • Avoid adversity • Withstand adversity • Recover from adversity Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) Levels 2 • Resilience • Adaptability • Transformability 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Affected system • Technical (IT) • Socio-technical • Social (humans)
  4. Resilience in IT Impact area • Infrastructure • Application •

    IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Value • Avoid adversity • Withstand adversity • Recover from adversity Affected system • Technical (IT) • Socio-technical • Social (humans) Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability Levels 2 • Resilience • Adaptability • Transformability Failure modes • Known failure modes • Unknown failure modes 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Focus of this presentation
  5. Known Unknown System Technical Social Failures Resilience in IT Chaos

    Engineering Other types of resilience and increased impact area Fault-tolerant software design Focus of this presentation
  6. What is the problem? Let us install our software on

    some HA hardware or infrastructure and everything will be fine.
  7. (Almost) every system is a distributed system. -- Chas Emerick

    http://www.infoq.com/presentations/problems-distributed-systems
  8. Failures in distributed systems ... • Crash failure • Omission

    failure • Timing failure • Response failure • Byzantine failure
  9. ... lead to a variety of effects … • Lost

    messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...
  10. ... turning seemingly simple issues into very hard ones Time

    & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  11. A distributed system is one in which the failure of

    a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport
  12. Failures in todays complex, distributed and interconnected systems are not

    the exception. • They are the normal case • They are not predictable
  13. … and it’s getting “worse” • Cloud-based systems • Service-based

    architectures • Zero Downtime • Mobile & IoT • Social Web à Ever-increasing complexity and connectivity
  14. re·sil·ience (of IT systems) n. The ability of a system

    to handle unexpected situations • without the user noticing it (ideal case) • with a graceful degradation of service (non-ideal case) The cautious attempt to provide a useful definition for resilience in the context of software systems. No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.
  15. Typical measures • High-availability (HA) hardware/software • Only applicable for

    very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design
  16. Typical measures • High-availability (HA) hardware/software • Only applicable for

    very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design Focus of this presentation
  17. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy

    Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  18. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy

    Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement Cool! Give me more! More cool stuff Nice! Sounds like theory Boring! One-off? Whatever! Well … Later! Well … Maybe!
  19. Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as

    perceived by engineer Actual relevance for system robustness
  20. Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as

    perceived by engineer Actual relevance for system robustness Ye be warned! If you do not get this part right, nothing else matters ‘ere be dragons! This part is extremely hard and poorly understood
  21. Isolation • System must not fail as a whole •

    Split system in parts and isolate parts against each other • Avoid cascading failures • Foundation of resilient software design
  22. Failure unit • Core isolation pattern • a.k.a. “units of

    mitigation” • Functional slices, isolated via bulkheads • Diverse implementation choices available • (Micro-)service, actor, CSP, SCS, ... • Choice impacts system and resilience design a lot • Shaping good failure units is extremely hard • Pure design issue
  23. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design How do we avoid this …
  24. … and this … Service Request Due to functional design

    we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service
  25. Mothership Service (a.k.a. Monolith) Request By trying to avoid the

    aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … without ending up with this?
  26. Well-known design best practices • Divide & conquer a.k.a. functional

    decomposition • DRY (Don’t Repeat Yourself) • Design for reusability • Layered architecture • …
  27. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design … this usually leads to this …
  28. … and this … Service Request Due to functional design

    we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service
  29. Mothership Service (a.k.a. Monolith) Request By trying to avoid the

    aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … and in the end also often to this.
  30. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design Cache of B Break tight service coupling by caching data/responses of downstream service
  31. Do you really think that copying stale data all over

    your system is a suitable measure to fix an inherently broken design? * * Caches are a great measure in several places, but they do not make up for a messed-up design
  32. Yet, a few guiding thoughts regarding bulkhead design * *

    not to be confused with silver bullets – it will still be hard work
  33. Foundations of design • “High cohesion, low coupling” & “separation

    of concerns” • Crucial across process boundaries • Still poorly understood issue • Start with • Understanding organizational boundaries • Understanding use cases and flows • Identifying functional domains (à DDD) • Finding areas that change independently • Do not start with a data model!
  34. Short activation paths • Long activation paths affect availability •

    Increase likelihood of failures • Minimize remote calls per request • Need to balance opposing forces • Avoid monolith à clear separation of concerns • Minimize requests à cluster functionality & data • Caches can sometimes help, but stale data as trade-off
  35. For good service design, look at the behavior first, not

    the data (see https://speakerdeck.com/ufried/getting-service-design-right for more details)
  36. Be (extremely) wary of reusability • Reusability increases coupling •

    Reusability usually leads to bad service design • Reusability compromises availability • Reusability rarely pays • Do not strive for reusability in distributed systems • Strive for replaceability instead • Try to tackle reusability issues with libraries
  37. The value of reuse unfolds inside a process boundary Across

    process boundaries it should be avoided as it maximizes coupling (see https://speakerdeck.com/ufried/the-reusability-fallacy for more details)
  38. Communication paradigm • Request-response <-> messaging <-> events <-> …

    • Heavily influences resilience patterns to be used • Also heavily influences functional failure unit design • Very fundamental decision which is often underestimated
  39. Request/Response Event-driven A draw … Complexity Built-in logic & orchestration

    Coordination Event chains & choreography Vertically (divide & conquer) Decomposition Horizontally (go-with-the-flow) Built-in error handling Error handling Escalation/supervision strategy Multiple responsibilities / service Separation of concerns Single responsibilities / service Transactions Built-in transaction handling External supervision Distributed domain logic / reuse Encapsulation Local domain logic / replaceability
  40. Online Shop Warehouse System <Foreign Service> <Own Service> Coupon Management

    Campaign Management Account Service PayPal Loyalty Management Accounts Receivables Music Library E-Book Library E-Mail Server Order Fulfillment Service Coordinate Shipment Service Warehouse Coordinate Assets Notify Cust. Payment Provider Credit Card PayPal Payment Service Promotion Loyalty Coupon Coordinate Video Library Credit Card Provider
  41. Order confirmed Online Shop Credit Card Provider Warehouse System <Foreign

    Service> <Own Service> Coupon Management Campaign Management Account service Credit Card Service Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server PayPal PayPal Service Warehouse Service Promotion Service Bonus Card Service Coupon Service Music Library Service Video Library Service E-Book Library Service Notification Service Payment authorized Digital asset provisioned Payment failed <Event> Order fulfillment supervisor Track flow of events Reschedule events in case of failure Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that
  42. Wrapping up • Today’s systems are distributed • Failures are

    neither avoidable nor predictable • Resilient software design needed • Functional design of failure units (“services”) is • crucial for application robustness • poorly understood • massively underrated • very different from traditional design best practices • Communication paradigms extend the design options