Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond the saga pattern

Beyond the saga pattern

This slide deck discusses the "fine print" often omitted when presenting the saga pattern as a way to coordinate updates between several services.

First, a summary of the popular explanations is provided as you can find them on the Internet (just search for "saga pattern microservices") and an initial omission is pointed out.

Then I go back to the original sources as I usually do if I find contemporary sources unsatisfying to understand the ideas of the original authors. Here, this means first revisiting the original "Sagas" paper by Hector Garcia-Molina and Kenneth Salem from 1987. After learning, that sagas were originally designed for a very different context, the slide deck briefly discusses the contemporary context – which are distributed systems.

With that background knowledge, we move on to two presentations by Caitie McCaffrey from 2015 and 2017 who originally introduced the idea of using sagas to coordinate updates across multiple microservices. As this still does not fill all the gaps, we finally look at "Building on quicksand" by Pat Helland and Dave Cambell from 2009.

Along the way, we fill the gaps and learn about all the "fine print" the popular explanations usually leave out. We also learn a lot about the intricacies of distributed systems and that approaching them with an enterprise software development mindset almost always is a bad idea.

As always, the voice track is missing. Still, I hope the slide deck provides a few useful ideas and insights ...

Uwe Friedrichsen

February 28, 2024
Tweet

More Decks by Uwe Friedrichsen

Other Decks in Technology

Transcript

  1. Beyond the saga pattern Exploring the constraints and how to

    overcome them Uwe Friedrichsen – codecentric AG – 2016-2024
  2. Saga Service A Local transaction Service B Local transaction Service

    C Local transaction The default explanation
  3. Saga Service A Service B Service C The extended explanation

    Local transaction Local transaction Local transaction Abort & compensate Compensating transaction
  4. A few more hints … • Orchestration and choreography as

    variants • Orchestration-based • Requires a centralized controller • Makes solution more complex • Choreography-based • Services respond to events • Transaction and sending event needs to be done atomically • Compensating transactions may be hard to design
  5. Saga Service A Service B Service C Compensating transaction Do

    we need a compensating compensating transaction? Local transaction Local transaction Local transaction Abort & compensate ?
  6. “Let us use the term saga to refer to a

    LLT (long lived transaction) that can be broken up into a collection of sub-transactions that can be interleaved in any way with other transactions. Each sub-transaction in this case is a real transaction in the sense that it preserves database consistency. However, unlike other transactions, the transactions in a saga are related to each other and should be executed as a (non-atomic) unit: any partial executions of the saga are undesirable, and if they occur, must be compensated for.” [Gar 1987]
  7. “The compensating transaction [Ci ] undoes, from a semantic point

    of view, any of the actions performed by Ti , but does not necessarily return the database to the state that existed when the execution of Ti began.” [Gar 1987]
  8. “[In some situations] it may always be possible to move

    forward and finish the LLT. In this case, it may not be necessary to ever compensate for an unfinished LLT.” [Gar 1987]
  9. “Two ingredients are necessary to make the ideas we have

    presented feasible: a DBMS that supports sagas, and LLTs that are broken into sequences of transactions. […] [Gar 1987]
  10. “We initially assume that compensating transactions can only encounter system

    failures. Later on […] we study the effects of other failures (e.g. program bugs) in compensating transactions.” [Gar 1987]
  11. “Note that it is possible to have each transaction store

    in the database the parameters that its compensating transaction may need in the future.” [Gar 1987]
  12. “In some cases it may be desirable to let the

    application programmer indicate through the save-point command where saga check points should be taken. This command can be issued between transactions. It forces the system to save the state of the running application program and returns a save-point identifier for future reference. The save points could then be useful in reducing the amount of work after a saga failure or a system crash: instead of compensating for all of the outstanding transactions, the system could compensate for transactions executed since the last save point, and then restart the saga.” [Gar 1987]
  13. “Within the DBMS, a saga execution component (SEC) manages sagas.

    This component calls on the conventional transaction execution component (TEC), which manages the execution of the individual transactions. The operation of the SEC is similar to that of the TEC: the SEC executes a series of transactions as a unit, while the TEC executes a series of actions as an (atomic) unit. Both components require a log to record the activities of sagas and transactions. […] All saga commands and database actions are channeled through the SEC. ” [Gar 1987]
  14. “But what happens if a compensating transaction cannot be successfully

    completed due to errors (e.g., it tries to read a file that does not exist, or there is a bug in the code)? The transaction could be aborted, but if it were run again it would probably encounter the same error. In this case, the system is stuck: it cannot abort the transaction nor can it complete it.” [Gar 1987]
  15. “One possible solution is to make use of software fault

    tolerant techniques along the lines of recovery blocks. A recovery block is an alternate or secondary block of code that is provided in case a failure is detected in the primary block. The other possible solution to this problem is manual intervention. The erroneous transaction is first aborted. Then it is given to an application programmer who, given a description of the error, can correct it. The SEC (or the application) then reruns the transaction and continues processing the saga.” [Gar 1987]
  16. “In our discussion of saga management we have assumed that

    the SEC is part of the DBMS and has direct access to the log. However, in some cases it may be desirable to run sagas on an existing DBMS that does not directly support them.” [Gar 1987]
  17. “There are basically two things to do. First, the saga

    commands embedded in the application code become subroutine calls […]. Each subroutine stores within the database all the information that the SEC would have stored in the log.” [Gar 1987]
  18. “Second, a special process must exist to implement the rest

    of the SEC functions. This process, the saga daemon (SD) would always be active. It would be restarted after a crash by the operating system. After a crash it would scan the saga tables to discover the status of pending sagas.” [Gar 1987]
  19. “Our model for sequential transaction execution within a saga can

    be extended to include parallel transactions.” [Gar 1987]
  20. “To identify potential sub-transactions within a LLT, one must search

    for natural divisions of the work being performed. In many cases, the LLT models a series of real world actions, and each of these actions is a candidate for a saga transaction. In other cases, it is the database itself that is naturally partitioned into relatively independent components, and the actions on each component can be grouped into a saga transaction.” [Gar 1987]
  21. “It may even be possible to compensate for actions that

    are harder to undo, like sending a letter or printing a check. For example, to compensate for the letter, send a second letter explaining the problem. To compensate for the check, send a stop-payment message to the bank.” [Gar 1987]
  22. “Also recall that pure forward recovery does not require compensating

    transactions. So if compensating transactions are hard to write, then one has the choice of tailoring the application so that LLTs do not have user initiated aborts. Without these aborts, pure forward recovery is feasible and compensation is never needed.” [Gar 1987]
  23. “As has become clear from our discussion, the structure of

    the database plays an important role in the design of sagas. Thus, it is best not to study each LLT in isolation, but to design the entire database with LLTs and sagas in mind. That is, if the database can be laid out into a set of loosely-coupled components (with few and simple inter-component consistency constraints), then it is likely that the LLT will naturally break up into sub-transactions that can be interleaved.” [Gar 1987]
  24. Concepts and definitions • Original intention is to mitigate the

    issues of LLTs • Saga is LLT broken up in interleavable sub-transactions • Compensation is a semantic rollback • Forward completion • Very powerful concept, but usually ignored • Can be used if no business-level errors are raised • Requires safe-points for efficient execution
  25. Assumptions and constraints • Built into DBMS using its transaction

    manager • Saga coordinator built into DBMS • Crash failures only • Single DBMS • Transactions do not need to be idempotent à Simplified implementation and coordination conditions
  26. Generalization • Persistent errors usually require human intervention • Implementation

    outside DBMS much harder • Even without considering distributed failure modes • Central coordinator for edge-case fault tolerance needed
  27. Design hints • Think holistically • Do not design sagas

    in isolation • High cohesion, low coupling at all levels • Split transactions along loosely coupled boundaries
  28. Failures in distributed systems ... • Crash failure • Omission

    failure • Timing failure • Response failure • Byzantine failure
  29. ... lead to a variety of effects … • Lost

    messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...
  30. ... turning seemingly simple issues into very hard ones Time

    & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  31. A distributed system is one in which the failure of

    a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport
  32. Continuous partial failure is the normal state of affairs. --

    Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing
  33. The question is no longer if failures will hit you.

    The only question left is when and how bad they will hit you.
  34. Failures in todays complex, distributed and interconnected systems are not

    the exception They are unavoidable They are the normal case They are not predictable
  35. An example You: “How do you handle the situation if

    the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  36. Another example You: “How should the application respond if a

    technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure this will not happen.”
  37. More variants of the trap • Infrastructure components will never

    fail • E.g., OS, schedulers, routers, switches, … • Middleware components will never fail • E.g., message queues, databases, … • All encompassing applications and services will never fail • No message loss, latency, response failures, …
  38. The “100% available” trap in a nutshell “Everything works perfectly,

    all the time. Nothing ever fails.” Successor of the “Ops is responsible for availability” mindset
  39. Continuous partial failure is the normal state of affairs. --

    Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing
  40. The question is no longer if failures will hit you.

    The only question left is when and how bad they will hit you.
  41. Regarding the original saga paper • Consequences of interleaving sub-transactions

    • Sub-transactions cannot depend on one another • E.g., TX 2 cannot use output of TX 1 as input • Compensating transactions are also interleaving • Sagas trade atomicity for availability
  42. Sagas in a distributed world • Saga Execution Coordinator (SEC)

    required • Saga log needs to be durable and distributed • Sub-transactions become service requests • Consistency guarantees of a service request may vary • ACID consistency cannot be assumed in general • Service requests do not need to be idempotent • Saga metadata must be available to SEC
  43. Service Saga log Start sagat (d) Start sagat,i (d) SEC

    Service Service Service Start Ri,1 End Ri,1 (c1 ) Start Ri,2 End Ri,2 (c2 ) Start Ri,3 End Ri,3 (c3 ) End sagai Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Start Ri,3 (d) End Ri,3 (c3 ) End Ri,2 (c2 ) Metadata Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: Saga instance ID c: Compensation data Based on [McC2015]
  44. Sagas in a distributed world • Saga may fail in

    3 ways • Service response indicates failure (business or technical) • Calling the service fails, including timeout (technical failure) • SEC crashes and needs to be restarted (technical failure)
  45. Service Saga log Start sagat (d) Start sagat,i (d) SEC

    Service Service Service Start Ri,1 End Ri,1 (c1 ) Start Ri,2 Abort sagai Start Ci,2 (d,∅) End Ci,2 Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Abort saga Metadata Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: Saga instance ID c: Compensation data Based on [McC2015] Start Ci,1 (d,c1 ) Start Ci,2 (d,∅) End Ci,2 Start Ci,1 (d,c1 ) End Ci,1 End Ci,1 End sagai
  46. Service Start sagat (d) Saga log Start sagat,i (d) Start

    Ri,1 Start Ri,2 Start Ci,2 (d,∅) End sagai End Ri,1 (c1 ) Abort sagai End Ci,2 SEC Service Service Service Metadata Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Start Ci,2 (d,∅) End Ci,2 Abort saga Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: I Saga instance ID c: Compensation data Based on [McC2015] Start Ci,1 (d,c1 ) End Ci,1 Start Ci,1 (d,c1 ) End Ci,1 What if a compensating request fails?
  47. Handling compensating requests • Compensating requests • must be idempotent

    • must not fail due to business-level errors • SEC replays compensating requests until they succeed • Only works for transient technical errors • Persistent technical errors still require human intervention
  48. Handling SEC crashes • SEC must be restarted automatically •

    Sagas in an unsafe state must be rolled back • Start request log entry w/o corresponding end request entry • Required because requests are not idempotent • Non-idempotent requests require at-most-once semantics Side note: Compensating requests cannot rely on the execution of the corresponding original request
  49. Idempotent requests • Simplifies saga implementation • Unsafe requests can

    simply be replayed after SEC crash • Idempotent requests allow for at-least-once semantics • May be hard to design depending on the given context
  50. Forward recovery • Prerequisites • Requests must be idempotent •

    Requests must not fail due to business-level errors • Compensating requests not required • Repeat requests until they eventually succeed • Allows for further simplifications and optimizations, e.g. • Execute all requests in parallel • Replay all requests if any if them failed • Only possible if given business-level context allows for it
  51. Distributed sagas – Summary • Distributed sagas require • Distributed,

    durable log • SEC process that is restarted if it crashes • Idempotent compensating requests (must not fail due to business-level errors) • Idempotent requests enable simpler implementations
  52. Distributed saga protocol • Prerequisites • Idempotent requests, may fail

    at business level • Idempotent compensating requests, may not fail at business level • Requests and compensating requests are commutative • Fault-tolerant and highly available distributed log • SEC that is automatically restarted after crash • Allows implementing distributed saga protocol
  53. Distributed saga protocol • Distributed saga protocol guarantees • either

    all requests eventually complete successfully • or a subset of the requests and their corresponding compensating requests are eventually executed
  54. Distributed saga protocol • Helps to avoid feral concurrency control

    (see [Bai 2015]) • Coordination logic built into services • Mixed with business logic • Needs to cover distributed systems edge cases • Logic often hardcoded into services multiple times
  55. Summary • Avoid feral concurrency control • Do not pollute

    services with coordination logic • Build a coordinator instead all services can simply use • Makes service developers’ life a lot easier
  56. Unfortunately, most software is written the opposite way: The software

    is unreliable and expects the underlying components to make the solution reliable
  57. What if … • a service fails while processing a

    message? • a service is stalled indefinitely while processing a message? • a message is lost between services? • a messages is processed more than once? • an old message re-emerges? • a message contains garbage? • a split-brain situation occurs? • …
  58. How do we detect such failures (without a coordinator)? How

    do we handle such failures? How do we reconcile (global) state after such failures occurred?
  59. Most popular saga pattern explanations ignore these failure modes blindly

    relying on the infrastructure and middleware to always guarantee perfect conditions
  60. “The deeper observation is that two things are coupled: 1)

    The change from synchronous checkpointing to asynchronous to save latency, and 2) The loss of a notion of an authoritative truth.” [Hel 2009]
  61. “When the backup system that participates in the enforcement of

    these business rules is asynchronously tied to the primary, the enforcement of these [business] rules inevitably becomes probabilistic! It is the cost/benefit analysis that values lower latency from asynchronous checkpointing higher than the risk of slippage in the business rule that generalizes the fault tolerant algorithm.” [Hel 2009]
  62. “Arguably, all computing really falls into three categories: memories, guesses,

    and apologies. The idea is that everything is done locally with a subset of the global knowledge. You know what you know when an action is performed. Since you have only a subset of the knowledge, your actions are really only guesses. When your knowledge as a replica increases, you may have an “Oh, crap!” moment. Reconciling your actions (as a replica) with the actions of an evil-twin of yours may result in recognition that there’s a mess to clean up. That may involve apologizing for your behavior (or the behavior of a replica).” [Hel 2009]
  63. Detection Current knowledge updates Activity triggers updates Compensation decision causes

    triggers causes observes observes Event ... Memory Guesses This part we usually implement assuming a perfect global knowledge in each node Decision causes uses Apologies This part we usually do not implement Incomplete & out-of-order ... affects
  64. “In a loosely coupled world choosing some level of availability

    over consistency, it is best to think of all computing as memories, guesses, and apologies.” [Hel 2009]
  65. “In many applications, it is possible to express the business

    rules of the application and allow different operations to be reordered in their execution. When the operations occur on separate systems and wind their way to each other, it is essential to preserve these business rules.” [Hel 2009]
  66. “The layering of an arbitrary application atop a storage subsystem

    inhibits reordering. Only when commutative operations are used can we achieve the desired loose coupling. Application operations can be commutative. WRITE is not commutative. Storage (i.e. READ and WRITE) is an annoying abstraction...” [Hel 2009]
  67. “It is essential to ensure that the work of a

    single operation is idempotent. This is an important design consideration in the creation of an application that can handle asynchrony as it tolerates faults and as it allows loose-coupling for latency, scale, and offline. Each operation must occur once (i.e. have the business impact of a single execution) even as it is processed (or simply logged) at multiple replicas.” [Hel 2009]
  68. “Consider the new ACID (or ACID2.0). The letters stand for:

    Associative, Commutative, Idempotent, and Distributed. The goal for ACID2.0 is to succeed if the pieces of the work happen: • At least once, • Anywhere in the system, • In any order. This defines a new KIND of consistency. The individual steps happen at one or more system. The application is explicitly tolerant of work happening out of order. It is tolerant of the work happening more than once per machine, too.” [Hel 2009]
  69. “When the application is constrained to the additional requirements of

    commutativity and associativity, the world gets a LOT easier. No longer must the state be checkpointed across failure units in a synchronous fashion. Instead, it is possible to be very lazy about the sharing of information. This opens up offline, slow links, low quality datacenters, and more.” [Hel 2009]
  70. “The major point is that availability (and its cousins offline

    and latency- reduction) may be traded off with classic notions of consistency. This tradeoff may frequently be applied across many different aspects at many levels of granularity within a single application.” [Hel 2009]
  71. “The best model for coping with violations of the business

    rule is: 1. Send the problem to a human (via email or something else), 2. If that’s too expensive, write some business specific software to reduce the probability that a human needs to be involved.” [Hel 2009]
  72. “Whenever the authors struggle with explaining how to implement loosely-coupled

    solutions, we look to how things were done before computers. In almost every case, we can find inspiration in paper forms, pneumatic tubes, and forms filed in triplicate.” [Hel 2009]
  73. Summary • Reliability must be built into the application •

    Asynchronous communication means the loss of a notion of a global shared truth • If using asynchronous coordination • Go for ACID 2.0 • Think in terms of memory, guesses and apologies • Generalization of the saga pattern
  74. Pattern generalization Distributed system failures Omission failures, latency, network disconnect

    (split brain), … Business level failures Compensating actions Forward completion Transaction splitting Effects of asynchronicity Hints for implementation of special cases Hints for saga design Persistent failures Bugs, corrupted environment, ... Crash failures Sagas (original paper) ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗ ✗ Building on Quicksand – – – ✓ ✓ ✓ ✓ ✓ ✓ – – Applying the Saga Pattern Distributed Sagas ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Popular explanations ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
  75. Design considerations • Distributed failure modes must be considered •

    Persistent failures must be considered • Asynchronous communication changes everything • Coordination and failure handling must be implemented at the application level (cannot be “left to middleware”) • Forward recovery offers additional design options
  76. Communication styles • Synchronous communication • Tighter temporal coupling but

    (relatively) easy to implement • Compensating actions need to be idempotent and must not raise business-level errors • Idempotent actions allow for optimized implementations • Asynchronous communication • Loose temporal coupling but challenging to implement • Make operations ACID 2.0 • Think in terms of memories, guesses and apologies
  77. Coordination styles • Central coordinator • Avoids feral concurrency control

    • Keeps services free from coordination & fault tolerance logic • Additional component improving separation of concerns • No central coordinator • Requires feral concurrency control • Coordination and fault tolerance logic spread across services • Also needed for asynchronous communication
  78. Complementing thoughts • Avoid the “100% available trap • Plan

    for (rare) manual intervention • Especially to handle persistent failures • Check if forward recovery is an option • Requires that actions do not raise business-level errors • Check if context-specific implementation is an option • Allows for much simpler implementations • Look at non-digital implementations for loose coupling
  79. If you need transactions that span multiple services often it

    is a design smell Check first if a service redesign would lead to a better solution before resorting to sagas
  80. Wrap-up • Most popular saga pattern explanations are insufficient •

    Do not take distributed failure modes into account • Fall for the “100% available” trap • Ignore challenges of asynchronous communication • Additional considerations needed for robust implementation • Carefully ponder communication and coordination style • Think about idempotency, ACID 2.0, forward recovery, … • Avoid feral concurrency control
  81. References (1/2) [Bai 2015] Peter Bailis et al., “Feral Concurrency

    Control: An Empirical Investigation of Modern Application Integrity”, SIGMOD 2015, 2015 [Gar 1987] Hector Garcia-Molina, Kenneth Salem, ”Sagas", SIGMOD ‘87, 1987 [Hel 2005] Pat Helland, “Life beyond distributed transactions”, CIDR ‘07, 2007
  82. References (2/2) [Hel 2009] Pat Helland & Dave Campbell, “Building

    on Quicksand”, CIDR ‘09 Perspectives, 2009 [McC 2015] Caitie McCaffrey, “Applying the Saga Pattern”, GOTO Chicago Conference 2015, https://www.youtube.com/watch?v=xDuwrtwYHu8 [McC 2017] Caitie McCaffrey, ”Distributed Sagas", J On The Beach 2017, https://www.youtube.com/watch?v=0UTOLRTwOX0