Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Resilient functional service design

Resilient functional service design

This slide deck addresses the importance of proper functional design for creating resilient, distributed systems (not only, but also microservice-based systems).

It starts with a short introduction how resilient software design as discussed in this presentation maps into the much broader topic of resilience and why resilient software design is a relevant topic that IT development departments cannot afford to ignore anymore.

Next, the slide deck explains the pitfall many developers fall into when getting started with resilience: Quite often, the technical fault-tolerance implementation patterns are overrated while the foundations, i.e., designing failure units (these days typically in the form of microservices) on a functional level and choosing the communication paradigm in turn are underrated.

Then, it is explained why this inverted prioritization is a problem - that all implementation patterns do not help if you do not get the functional design including the distribution of functionality across the failure units right. Additionally, it is explained why the wide-spread design "best practices" do not help in this context, but instead make things even worse.

The core observation of the slide deck is that we basically need to re-learn functional design if we want to create robust distributed systems because the existing "best practices" that are great inside process boundaries are usually counterproductive across process boundaries.

The slide deck does not offer any silver bullets to solve the problem (and I believe there is no silver bullet) Instead it offers a few guiding principles. Additionally, it shows how the communication paradigm choice influences the failure unit design a lot, this way creating more options for a good service design that also supports resiliency on a functional level.

As always this slide deck is without the voice track, i.e., most of the information is probably missing. But I hope that the slides on their own also provide some helpful hints.

PS: This is an edited and updated version of an older presentation. The core message is the same, but quite some details were changed and updated.

Uwe Friedrichsen

July 29, 2021
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Resilient functional service design
    The usually forgotten parts of resilient software design (updated 2021 edition)
    Uwe Friedrichsen – codecentric AG – 2015-2021

    View Slide

  2. Uwe Friedrichsen
    CTO @ codecentric
    https://twitter.com/ufried
    https://www.speakerdeck.com/ufried
    https://ufried.com/

    View Slide

  3. What is that “resilience” thing?

    View Slide

  4. re·sil·ience (rĭ-zĭl′yəns)
    n.
    1. The ability to recover quickly from illness, change, or misfortune; buoyancy.
    2. The property of a material that enables it to resume its original shape or position after being
    bent, stretched, or compressed; elasticity.
    American Heritage® Dictionary of the English Language, Fifth Edition. Copyright © 2016 by Houghton Mifflin Harcourt Publishing
    Company. Published by Houghton Mifflin Harcourt Publishing Company. All rights reserved.
    https://www.thefreedictionary.com/resilience

    View Slide

  5. What has that got to do with IT?

    View Slide

  6. Well, it depends on what you are looking at

    View Slide

  7. The comprehensive view
    1

    View Slide

  8. Resilience is a huge topic with many aspects …

    View Slide

  9. … which can also be applied to IT

    View Slide

  10. Resilience in IT
    Impact area
    • Infrastructure
    • Application
    • IT organization
    • Product organization
    • Company
    • Corporate
    • Beyond corp
    (people, society, ecology, …)
    Failure modes
    • Known failure modes
    • Unknown failure modes
    Value
    • Avoid adversity
    • Withstand adversity
    • Recover from adversity
    Types 1
    • Rebound
    • Robustness
    • Graceful extensibility
    • Sustained adaptability
    1 according to David D. Woods, “Four concepts for resilience and the
    implications for the future of resilience engineering”, Reliability
    Engineering and System Safety (2015)
    Levels 2
    • Resilience
    • Adaptability
    • Transformability
    2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig,
    “Resilience, adaptability and transformability in social–ecological
    systems”, Ecology and Society 9(2): 5 (2004)
    Affected system
    • Technical (IT)
    • Socio-technical
    • Social (humans)

    View Slide

  11. Resilience in IT
    Impact area
    • Infrastructure
    • Application
    • IT organization
    • Product organization
    • Company
    • Corporate
    • Beyond corp
    (people, society, ecology, …)
    Value
    • Avoid adversity
    • Withstand adversity
    • Recover from adversity
    Affected system
    • Technical (IT)
    • Socio-technical
    • Social (humans)
    Types 1
    • Rebound
    • Robustness
    • Graceful extensibility
    • Sustained adaptability
    Levels 2
    • Resilience
    • Adaptability
    • Transformability
    Failure modes
    • Known failure modes
    • Unknown failure modes
    1 according to David D. Woods, “Four concepts for resilience and the
    implications for the future of resilience engineering”, Reliability
    Engineering and System Safety (2015)
    2 according to B. Walker, C. S. Holling, S. R. Carpenter, A. Kinzig,
    “Resilience, adaptability and transformability in social–ecological
    systems”, Ecology and Society 9(2): 5 (2004)
    Focus of this presentation

    View Slide

  12. Known Unknown
    System
    Technical
    Social
    Failures
    Resilience in IT
    Chaos
    Engineering
    Other types of resilience
    and increased impact area
    Fault-tolerant
    software design
    Focus of this
    presentation

    View Slide

  13. The systems perspective
    2

    View Slide

  14. It is all about production!

    View Slide

  15. Business value
    Systems in production
    is delivered by
    Systems are available
    but only if

    View Slide

  16. What is the problem?
    Let us install our software on some
    HA hardware or infrastructure
    and everything will be fine.

    View Slide

  17. For a single, monolithic, isolated system
    this might indeed work, but …

    View Slide

  18. (Almost) every system is a distributed system.
    -- Chas Emerick
    http://www.infoq.com/presentations/problems-distributed-systems

    View Slide

  19. Distributed systems in a nutshell

    View Slide

  20. Everything fails, all the time.
    -- Werner Vogels

    View Slide

  21. Failures in distributed systems ...
    • Crash failure
    • Omission failure
    • Timing failure
    • Response failure
    • Byzantine failure

    View Slide

  22. ... lead to a variety of effects …
    • Lost messages
    • Incomplete messages
    • Duplicate messages
    • Distorted messages
    • Delayed messages
    • Out-of-order message arrival
    • Partial, out-of-sync local memory
    • ...

    View Slide

  23. ... turning seemingly simple issues into very hard ones
    Time & Ordering
    Leslie Lamport
    "Time, clocks, and the
    ordering of events in
    distributed systems"
    Consensus
    Leslie Lamport
    ”The part-time
    parliament”
    (Paxos)
    CAP
    Eric A. Brewer
    "Towards robust
    distributed systems"
    Faulty processes
    Leslie Lamport,
    Robert Shostak,
    Marshall Pease
    "The Byzantine
    generals problem"
    Consensus
    Michael J. Fischer,
    Nancy A. Lynch,
    Michael S. Paterson
    "Impossibility of
    distributed consensus
    with one faulty
    process” (FLP)
    Impossibility
    Nancy A. Lynch
    ”A hundred
    impossibility proofs
    for distributed
    computing"

    View Slide

  24. A distributed system is one in which the failure
    of a computer you didn't even know existed
    can render your own computer unusable.
    -- Leslie Lamport

    View Slide

  25. Failures in todays complex, distributed and
    interconnected systems are not the exception.
    • They are the normal case
    • They are not predictable

    View Slide

  26. … and it’s getting “worse”
    • Cloud-based systems
    • Service-based architectures
    • Zero Downtime
    • Mobile & IoT
    • Social Web
    à Ever-increasing complexity and connectivity

    View Slide

  27. Do not try to avoid failures
    Embrace them

    View Slide

  28. re·sil·ience (of IT systems)
    n.
    The ability of a system to handle unexpected situations
    • without the user noticing it (ideal case)
    • with a graceful degradation of service (non-ideal case)
    The cautious attempt to provide a useful definition for resilience in the context of software systems.
    No copyright attached, but also no guarantee that this definition is sufficient for all relevant purposes.

    View Slide

  29. Typical measures
    • High-availability (HA) hardware/software
    • Only applicable for very small installations
    • Usually not available in cloud environments
    • Delegate failure handling to infrastructure level
    • Partial relief, will not solve all problems
    • Implement resilient software design patterns
    • Very important, still will not fix a bad design
    • Minimize number of predetermined breaking points
    • Minimize problem surface by design

    View Slide

  30. Typical measures
    • High-availability (HA) hardware/software
    • Only applicable for very small installations
    • Usually not available in cloud environments
    • Delegate failure handling to infrastructure level
    • Partial relief, will not solve all problems
    • Implement resilient software design patterns
    • Very important, still will not fix a bad design
    • Minimize number of predetermined breaking points
    • Minimize problem surface by design
    Focus of this presentation

    View Slide

  31. Designing for resilience …

    View Slide

  32. First you learn about resilience …

    View Slide

  33. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement

    View Slide

  34. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement
    Supporting
    patterns
    Redundancy
    Stateless
    Idempotency
    Escalation
    Zero downtime
    deployment
    Location
    transparency
    Relaxed
    temporal
    constraints
    Fallback
    Shed load
    Share load
    Marked data Queue for
    resources
    Bounded queue
    Finish work in
    progress
    Fresh work
    before stale
    Deferrable work
    Communication
    paradigm
    Isolation
    Failure unit
    System level
    Monitor
    Watchdog
    Heartbeat
    Either level
    Voting
    Synthetic
    transaction
    Leaky
    bucket
    Routine
    checks
    Health
    check
    Fail fast
    Let sleeping dogs lie
    Small releases
    Hot deployments
    Routine maintenance
    Backup request
    Anti-fragility
    Diversity Jitter
    Error
    injection
    Spread the news
    Anti-entropy
    Backpressure
    Retry
    Limit retries
    Rollback
    Roll-forward
    Checkpoint Safe point
    Failover
    Read repair
    Error
    handler
    Reset
    Restart
    Reconnect
    Fail silently
    Default value
    Node level
    Timeout
    Circuit breaker
    Complete
    parameter
    checking
    Checksum
    Statically
    Dynamically
    Confinement
    Acknowledgement

    View Slide

  35. … then you digest the stuff you have learned

    View Slide

  36. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement
    Supporting
    patterns
    Redundancy
    Stateless
    Idempotency
    Escalation
    Zero downtime
    deployment
    Location
    transparency
    Relaxed
    temporal
    constraints
    Fallback
    Shed load
    Share load
    Marked data Queue for
    resources
    Bounded queue
    Finish work in
    progress
    Fresh work
    before stale
    Deferrable work
    Communication
    paradigm
    Isolation
    Failure unit
    System level
    Monitor
    Watchdog
    Heartbeat
    Either level
    Voting
    Synthetic
    transaction
    Leaky
    bucket
    Routine
    checks
    Health
    check
    Fail fast
    Let sleeping dogs lie
    Small releases
    Hot deployments
    Routine maintenance
    Backup request
    Anti-fragility
    Diversity Jitter
    Error
    injection
    Spread the news
    Anti-entropy
    Backpressure
    Retry
    Limit retries
    Rollback
    Roll-forward
    Checkpoint Safe point
    Failover
    Read repair
    Error
    handler
    Reset
    Restart
    Reconnect
    Fail silently
    Default value
    Node level
    Timeout
    Circuit breaker
    Complete
    parameter
    checking
    Checksum
    Statically
    Dynamically
    Confinement
    Acknowledgement
    Cool!
    Give me more!
    More cool stuff
    Nice!
    Sounds like
    theory
    Boring!
    One-off?
    Whatever!
    Well …
    Later!
    Well …
    Maybe!

    View Slide

  37. Core Detect
    Relevance
    Recover Mitigate Treat Prevent Complement
    Relevance as perceived by engineer Actual relevance for system robustness

    View Slide

  38. Core Detect
    Relevance
    Recover Mitigate Treat Prevent Complement
    Relevance as perceived by engineer Actual relevance for system robustness
    Ye be warned!
    If you do not get this part
    right, nothing else matters
    ‘ere be dragons!
    This part is extremely hard
    and poorly understood

    View Slide

  39. We have a problem!

    View Slide

  40. Let us take a closer look
    at the core parts

    View Slide

  41. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement

    View Slide

  42. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement
    Isolation

    View Slide

  43. Isolation
    • System must not fail as a whole
    • Split system in parts and isolate parts against each other
    • Avoid cascading failures
    • Foundation of resilient software design

    View Slide

  44. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement
    Isolation
    Failure unit

    View Slide

  45. Failure unit
    • Core isolation pattern
    • a.k.a. “units of mitigation”
    • Functional slices, isolated via bulkheads
    • Diverse implementation choices available
    • (Micro-)service, actor, CSP, SCS, ...
    • Choice impacts system and resilience design a lot
    • Shaping good failure units is extremely hard
    • Pure design issue

    View Slide

  46. Really? Sounds easy!
    Where is the problem?

    View Slide

  47. Service A Service B
    Request
    Due to functional design, Service A
    always needs backing from Service B
    to be able to answer a client request,
    i.e., the isolation is broken by design
    How do we avoid this …

    View Slide

  48. … and this …
    Service
    Request
    Due to functional design we need
    to call a lot of services to be able
    to answer a client request,
    i.e., availability is broken by design
    Service
    Service
    Service Service
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Service

    View Slide

  49. Mothership Service
    (a.k.a. Monolith)
    Request
    By trying to avoid the aforementioned
    issues we ended up with cramming all
    required functionality in one big service
    i.e., the isolation is broken by design
    … without ending up with this?

    View Slide

  50. Let us apply the well-known design best practices!

    View Slide

  51. Well-known design best practices
    • Divide & conquer a.k.a. functional decomposition
    • DRY (Don’t Repeat Yourself)
    • Design for reusability
    • Layered architecture
    • …

    View Slide

  52. Unfortunately …

    View Slide

  53. Service A Service B
    Request
    Due to functional design, Service A
    always needs backing from Service B
    to be able to answer a client request,
    i.e., the isolation is broken by design
    … this usually leads to this …

    View Slide

  54. … and this …
    Service
    Request
    Due to functional design we need
    to call a lot of services to be able
    to answer a client request,
    i.e., availability is broken by design
    Service
    Service
    Service Service
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Service

    View Slide

  55. Mothership Service
    (a.k.a. Monolith)
    Request
    By trying to avoid the aforementioned
    issues we ended up with cramming all
    required functionality in one big service
    i.e., the isolation is broken by design
    … and in the end also often to this.

    View Slide

  56. Welcome to distributed hell!

    View Slide

  57. Caches to the rescue!

    View Slide

  58. Service A Service B
    Request
    Due to functional design, Service A
    always needs backing from Service B
    to be able to answer a client request,
    i.e., the isolation is broken by design
    Cache of B
    Break tight service coupling
    by caching data/responses
    of downstream service

    View Slide

  59. Do you really think that copying stale data all over your system
    is a suitable measure to fix an inherently broken design? *
    * Caches are a great measure in several places, but they do not make up for a messed-up design

    View Slide

  60. Sorry, not that easy

    View Slide

  61. We must re-learn functional design
    for distributed systems!

    View Slide

  62. Okay, where is the silver bullet?

    View Slide

  63. Sorry, out of stock *
    * as always

    View Slide

  64. Maybe some 5-minute crash course, some checklist, …?

    View Slide

  65. Sorry, not that easy

    View Slide

  66. It is a lot of hard work …

    View Slide

  67. … and there is no shortcut

    View Slide

  68. Yet, a few guiding thoughts regarding bulkhead design *
    * not to be confused with silver bullets – it will still be hard work

    View Slide

  69. Foundations of design
    • “High cohesion, low coupling” & “separation of concerns”
    • Crucial across process boundaries
    • Still poorly understood issue
    • Start with
    • Understanding organizational boundaries
    • Understanding use cases and flows
    • Identifying functional domains (à DDD)
    • Finding areas that change independently
    • Do not start with a data model!

    View Slide

  70. Short activation paths
    • Long activation paths affect availability
    • Increase likelihood of failures
    • Minimize remote calls per request
    • Need to balance opposing forces
    • Avoid monolith à clear separation of concerns
    • Minimize requests à cluster functionality & data
    • Caches can sometimes help, but stale data as trade-off

    View Slide

  71. For good service design, look at the behavior first, not the data
    (see https://speakerdeck.com/ufried/getting-service-design-right for more details)

    View Slide

  72. Be (extremely) wary of reusability
    • Reusability increases coupling
    • Reusability usually leads to bad service design
    • Reusability compromises availability
    • Reusability rarely pays
    • Do not strive for reusability in distributed systems
    • Strive for replaceability instead
    • Try to tackle reusability issues with libraries

    View Slide

  73. Aiming for reusability in a distributed system compromises
    system availability and response times by design

    View Slide

  74. The value of reuse unfolds inside a process boundary
    Across process boundaries it should be avoided as it maximizes coupling
    (see https://speakerdeck.com/ufried/the-reusability-fallacy for more details)

    View Slide

  75. Extending the options

    View Slide

  76. Core
    Detect Treat
    Prevent
    Recover
    Mitigate Complement
    Isolation
    Failure unit
    Communication
    paradigm

    View Slide

  77. Communication paradigm
    • Request-response <-> messaging <-> events <-> …
    • Heavily influences resilience patterns to be used
    • Also heavily influences functional failure unit design
    • Very fundamental decision which is often underestimated

    View Slide

  78. Request/Response : Horizontal slicing
    Flow / Process
    Event-driven : Vertical slicing
    Flow / Process

    View Slide

  79. Request/Response Event-driven
    A draw …
    Complexity
    Built-in logic & orchestration
    Coordination Event chains & choreography
    Vertically (divide & conquer)
    Decomposition Horizontally (go-with-the-flow)
    Built-in error handling
    Error handling Escalation/supervision strategy
    Multiple responsibilities / service
    Separation of concerns Single responsibilities / service
    Transactions Built-in transaction handling External supervision
    Distributed domain logic / reuse
    Encapsulation Local domain logic / replaceability

    View Slide

  80. Workshop example: Order fulfillment

    View Slide

  81. The request/response communication sample solution

    View Slide

  82. Online Shop
    Warehouse
    System


    Coupon
    Management
    Campaign
    Management
    Account
    Service
    PayPal
    Loyalty
    Management
    Accounts
    Receivables
    Music Library
    E-Book Library
    E-Mail Server
    Order Fulfillment
    Service
    Coordinate
    Shipment
    Service
    Warehouse
    Coordinate
    Assets
    Notify Cust.
    Payment
    Provider
    Credit Card
    PayPal
    Payment
    Service
    Promotion
    Loyalty
    Coupon
    Coordinate
    Video Library
    Credit Card
    Provider

    View Slide

  83. The event-based communication sample solution

    View Slide

  84. Order confirmed
    Online Shop
    Credit Card
    Provider
    Warehouse
    System


    Coupon
    Management
    Campaign
    Management
    Account
    service
    Credit Card
    Service
    Loyalty
    Management
    Accounts
    Receivables
    Music Library
    E-Book Library
    Video Library E-Mail Server
    PayPal
    PayPal
    Service
    Warehouse
    Service
    Promotion
    Service
    Bonus Card
    Service
    Coupon
    Service
    Music Library
    Service
    Video Library
    Service
    E-Book Library
    Service
    Notification
    Service
    Payment authorized Digital asset provisioned
    Payment failed

    Order fulfillment
    supervisor
    Track flow of events
    Reschedule events
    in case of failure
    Services are responsible to
    eventually succeed or fail for
    good, usually incorporating a
    supervision/escalation
    hierarchy for that

    View Slide

  85. The communication paradigm influences
    the functional service design a lot
    and the resilience patterns to be used, too

    View Slide

  86. Do not limit your design options upfront without an important reason

    View Slide

  87. Wrapping up

    View Slide

  88. Wrapping up
    • Today’s systems are distributed
    • Failures are neither avoidable nor predictable
    • Resilient software design needed
    • Functional design of failure units (“services”) is
    • crucial for application robustness
    • poorly understood
    • massively underrated
    • very different from traditional design best practices
    • Communication paradigms extend the design options

    View Slide

  89. We must re-learn functional design
    for distributed systems!

    View Slide

  90. Uwe Friedrichsen
    CTO @ codecentric
    https://twitter.com/ufried
    https://www.speakerdeck.com/ufried
    https://ufried.com/

    View Slide