$30 off During Our Annual Pro Sale. View Details »

Resilient functional service design

Resilient functional service design

This slide deck addresses the importance of proper functional design for creating resilient, distributed systems (not only, but also microservice-based systems).

It starts with a short introduction how resilient software design as discussed in this presentation maps into the much broader topic of resilience and why resilient software design is a relevant topic that IT development departments cannot afford to ignore anymore.

Next, the slide deck explains the pitfall many developers fall into when getting started with resilience: Quite often, the technical fault-tolerance implementation patterns are overrated while the foundations, i.e., designing failure units (these days typically in the form of microservices) on a functional level and choosing the communication paradigm in turn are underrated.

Then, it is explained why this inverted prioritization is a problem - that all implementation patterns do not help if you do not get the functional design including the distribution of functionality across the failure units right. Additionally, it is explained why the wide-spread design "best practices" do not help in this context, but instead make things even worse.

The core observation of the slide deck is that we basically need to re-learn functional design if we want to create robust distributed systems because the existing "best practices" that are great inside process boundaries are usually counterproductive across process boundaries.

The slide deck does not offer any silver bullets to solve the problem (and I believe there is no silver bullet) Instead it offers a few guiding principles. Additionally, it shows how the communication paradigm choice influences the failure unit design a lot, this way creating more options for a good service design that also supports resiliency on a functional level.

As always this slide deck is without the voice track, i.e., most of the information is probably missing. But I hope that the slides on their own also provide some helpful hints.

PS: This is an edited and updated version of an older presentation. The core message is the same, but quite some details were changed and updated.

Uwe Friedrichsen

July 29, 2021
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient functional service design The usually forgotten parts of resilient

    software design (updated 2021 edition) Uwe Friedrichsen – codecentric AG – 2015-2021
  2. Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

  3. What is that “resilience” thing?

  4. re·sil·ience (rĭ-zĭl′yəns) n. 1. The ability to recover quickly from

    illness, change, or misfortune; buoyancy. 2. The property of a material that enables it to resume its original shape or position after being bent, stretched, or compressed; elasticity. American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved. https://www.thefreedictionary.com/resilience
  5. What has that got to do with IT?

  6. Well, it depends on what you are looking at

  7. The comprehensive view 1

  8. Resilience is a huge topic with many aspects …

  9. … which can also be applied to IT

  10. Resilience in IT Impact area • Infrastructure • Application •

    IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Failure modes • Known failure modes • Unknown failure modes Value • Avoid adversity • Withstand adversity • Recover from adversity Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) Levels 2 • Resilience • Adaptability • Transformability 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Affected system • Technical (IT) • Socio-technical • Social (humans)
  11. Resilience in IT Impact area • Infrastructure • Application •

    IT organization • Product organization • Company • Corporate • Beyond corp (people, society, ecology, …) Value • Avoid adversity • Withstand adversity • Recover from adversity Affected system • Technical (IT) • Socio-technical • Social (humans) Types 1 • Rebound • Robustness • Graceful extensibility • Sustained adaptability Levels 2 • Resilience • Adaptability • Transformability Failure modes • Known failure modes • Unknown failure modes 1 according to David D. Woods, “Four concepts for resilience and the implications for the future of resilience engineering”, Reliability Engineering and System Safety (2015) 2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig, “Resilience, adaptability and transformability in social–ecological systems”, Ecology and Society 9(2): 5 (2004) Focus of this presentation
  12. Known Unknown System Technical Social Failures Resilience in IT Chaos

    Engineering Other types of resilience and increased impact area Fault-tolerant software design Focus of this presentation
  13. The systems perspective 2

  14. It is all about production!

  15. Business value Systems in production is delivered by Systems are

    available but only if
  16. What is the problem? Let us install our software on

    some HA hardware or infrastructure and everything will be fine.
  17. For a single, monolithic, isolated system this might indeed work,

    but …
  18. (Almost) every system is a distributed system. -- Chas Emerick

    http://www.infoq.com/presentations/problems-distributed-systems
  19. Distributed systems in a nutshell

  20. Everything fails, all the time. -- Werner Vogels

  21. Failures in distributed systems ... • Crash failure • Omission

    failure • Timing failure • Response failure • Byzantine failure
  22. ... lead to a variety of effects … • Lost

    messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...
  23. ... turning seemingly simple issues into very hard ones Time

    & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  24. A distributed system is one in which the failure of

    a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport
  25. Failures in todays complex, distributed and interconnected systems are not

    the exception. • They are the normal case • They are not predictable
  26. … and it’s getting “worse” • Cloud-based systems • Service-based

    architectures • Zero Downtime • Mobile & IoT • Social Web à Ever-increasing complexity and connectivity
  27. Do not try to avoid failures Embrace them

  28. re·sil·ience (of IT systems) n. The ability of a system

    to handle unexpected situations • without the user noticing it (ideal case) • with a graceful degradation of service (non-ideal case) The cautious attempt to provide a useful definition for resilience in the context of software systems. No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.
  29. Typical measures • High-availability (HA) hardware/software • Only applicable for

    very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design
  30. Typical measures • High-availability (HA) hardware/software • Only applicable for

    very small installations • Usually not available in cloud environments • Delegate failure handling to infrastructure level • Partial relief, will not solve all problems • Implement resilient software design patterns • Very important, still will not fix a bad design • Minimize number of predetermined breaking points • Minimize problem surface by design Focus of this presentation
  31. Designing for resilience …

  32. First you learn about resilience …

  33. Core Detect Treat Prevent Recover Mitigate Complement

  34. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy

    Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  35. … then you digest the stuff you have learned

  36. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy

    Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Failure unit System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement Cool! Give me more! More cool stuff Nice! Sounds like theory Boring! One-off? Whatever! Well … Later! Well … Maybe!
  37. Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as

    perceived by engineer Actual relevance for system robustness
  38. Core Detect Relevance Recover Mitigate Treat Prevent Complement Relevance as

    perceived by engineer Actual relevance for system robustness Ye be warned! If you do not get this part right, nothing else matters ‘ere be dragons! This part is extremely hard and poorly understood
  39. We have a problem!

  40. Let us take a closer look at the core parts

  41. Core Detect Treat Prevent Recover Mitigate Complement

  42. Core Detect Treat Prevent Recover Mitigate Complement Isolation

  43. Isolation • System must not fail as a whole •

    Split system in parts and isolate parts against each other • Avoid cascading failures • Foundation of resilient software design
  44. Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit

  45. Failure unit • Core isolation pattern • a.k.a. “units of

    mitigation” • Functional slices, isolated via bulkheads • Diverse implementation choices available • (Micro-)service, actor, CSP, SCS, ... • Choice impacts system and resilience design a lot • Shaping good failure units is extremely hard • Pure design issue
  46. Really? Sounds easy! Where is the problem?

  47. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design How do we avoid this …
  48. … and this … Service Request Due to functional design

    we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service
  49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the

    aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … without ending up with this?
  50. Let us apply the well-known design best practices!

  51. Well-known design best practices • Divide & conquer a.k.a. functional

    decomposition • DRY (Don’t Repeat Yourself) • Design for reusability • Layered architecture • …
  52. Unfortunately …

  53. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design … this usually leads to this …
  54. … and this … Service Request Due to functional design

    we need to call a lot of services to be able to answer a client request, i.e., availability is broken by design Service Service Service Service Service Service Service Service Service Service Service Service
  55. Mothership Service (a.k.a. Monolith) Request By trying to avoid the

    aforementioned issues we ended up with cramming all required functionality in one big service i.e., the isolation is broken by design … and in the end also often to this.
  56. Welcome to distributed hell!

  57. Caches to the rescue!

  58. Service A Service B Request Due to functional design, Service

    A always needs backing from Service B to be able to answer a client request, i.e., the isolation is broken by design Cache of B Break tight service coupling by caching data/responses of downstream service
  59. Do you really think that copying stale data all over

    your system is a suitable measure to fix an inherently broken design? * * Caches are a great measure in several places, but they do not make up for a messed-up design
  60. Sorry, not that easy

  61. We must re-learn functional design for distributed systems!

  62. Okay, where is the silver bullet?

  63. Sorry, out of stock * * as always

  64. Maybe some 5-minute crash course, some checklist, …?

  65. Sorry, not that easy

  66. It is a lot of hard work …

  67. … and there is no shortcut

  68. Yet, a few guiding thoughts regarding bulkhead design * *

    not to be confused with silver bullets – it will still be hard work
  69. Foundations of design • “High cohesion, low coupling” & “separation

    of concerns” • Crucial across process boundaries • Still poorly understood issue • Start with • Understanding organizational boundaries • Understanding use cases and flows • Identifying functional domains (à DDD) • Finding areas that change independently • Do not start with a data model!
  70. Short activation paths • Long activation paths affect availability •

    Increase likelihood of failures • Minimize remote calls per request • Need to balance opposing forces • Avoid monolith à clear separation of concerns • Minimize requests à cluster functionality & data • Caches can sometimes help, but stale data as trade-off
  71. For good service design, look at the behavior first, not

    the data (see https://speakerdeck.com/ufried/getting-service-design-right for more details)
  72. Be (extremely) wary of reusability • Reusability increases coupling •

    Reusability usually leads to bad service design • Reusability compromises availability • Reusability rarely pays • Do not strive for reusability in distributed systems • Strive for replaceability instead • Try to tackle reusability issues with libraries
  73. Aiming for reusability in a distributed system compromises system availability

    and response times by design
  74. The value of reuse unfolds inside a process boundary Across

    process boundaries it should be avoided as it maximizes coupling (see https://speakerdeck.com/ufried/the-reusability-fallacy for more details)
  75. Extending the options

  76. Core Detect Treat Prevent Recover Mitigate Complement Isolation Failure unit

    Communication paradigm
  77. Communication paradigm • Request-response <-> messaging <-> events <-> …

    • Heavily influences resilience patterns to be used • Also heavily influences functional failure unit design • Very fundamental decision which is often underestimated
  78. Request/Response : Horizontal slicing Flow / Process Event-driven : Vertical

    slicing Flow / Process
  79. Request/Response Event-driven A draw … Complexity Built-in logic & orchestration

    Coordination Event chains & choreography Vertically (divide & conquer) Decomposition Horizontally (go-with-the-flow) Built-in error handling Error handling Escalation/supervision strategy Multiple responsibilities / service Separation of concerns Single responsibilities / service Transactions Built-in transaction handling External supervision Distributed domain logic / reuse Encapsulation Local domain logic / replaceability
  80. Workshop example: Order fulfillment

  81. The request/response communication sample solution

  82. Online Shop Warehouse System <Foreign Service> <Own Service> Coupon Management

    Campaign Management Account Service PayPal Loyalty Management Accounts Receivables Music Library E-Book Library E-Mail Server Order Fulfillment Service Coordinate Shipment Service Warehouse Coordinate Assets Notify Cust. Payment Provider Credit Card PayPal Payment Service Promotion Loyalty Coupon Coordinate Video Library Credit Card Provider
  83. The event-based communication sample solution

  84. Order confirmed Online Shop Credit Card Provider Warehouse System <Foreign

    Service> <Own Service> Coupon Management Campaign Management Account service Credit Card Service Loyalty Management Accounts Receivables Music Library E-Book Library Video Library E-Mail Server PayPal PayPal Service Warehouse Service Promotion Service Bonus Card Service Coupon Service Music Library Service Video Library Service E-Book Library Service Notification Service Payment authorized Digital asset provisioned Payment failed <Event> Order fulfillment supervisor Track flow of events Reschedule events in case of failure Services are responsible to eventually succeed or fail for good, usually incorporating a supervision/escalation hierarchy for that
  85. The communication paradigm influences the functional service design a lot

    and the resilience patterns to be used, too
  86. Do not limit your design options upfront without an important

    reason
  87. Wrapping up

  88. Wrapping up • Today’s systems are distributed • Failures are

    neither avoidable nor predictable • Resilient software design needed • Functional design of failure units (“services”) is • crucial for application robustness • poorly understood • massively underrated • very different from traditional design best practices • Communication paradigms extend the design options
  89. We must re-learn functional design for distributed systems!

  90. Uwe Friedrichsen CTO @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/