Beyond the saga pattern

Slide 1

Slide 1 text

Beyond the saga pattern Exploring the constraints and how to overcome them Uwe Friedrichsen – codecentric AG – 2016-2024

Slide 2

Slide 2 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/

Slide 3

Slide 3 text

The story usually told

Slide 4

Slide 4 text

Sometimes you need to coordinate updates across multiple microservices

Slide 5

Slide 5 text

Do not use distributed transactions!

Slide 6

Slide 6 text

Use the saga pattern instead!

Slide 7

Slide 7 text

Saga Service A Local transaction Service B Local transaction Service C Local transaction The default explanation

Slide 8

Slide 8 text

Saga Service A Service B Service C The extended explanation Local transaction Local transaction Local transaction Abort & compensate Compensating transaction

Slide 9

Slide 9 text

A few more hints … • Orchestration and choreography as variants • Orchestration-based • Requires a centralized controller • Makes solution more complex • Choreography-based • Services respond to events • Transaction and sending event needs to be done atomically • Compensating transactions may be hard to design

Slide 10

Slide 10 text

… and we are done.

Slide 11

Slide 11 text

Yay!

Slide 12

Slide 12 text

But is it really that easy?

Slide 13

Slide 13 text

What if a compensating transaction fails?

Slide 14

Slide 14 text

Saga Service A Service B Service C Compensating transaction Do we need a compensating compensating transaction? Local transaction Local transaction Local transaction Abort & compensate ?

Slide 15

Slide 15 text

And if a compensating compensating transaction fails? (to be continued)

Slide 16

Slide 16 text

Hmm, maybe it is not that easy …

Slide 17

Slide 17 text

Best re-start from scratch

Slide 18

Slide 18 text

SAGAS by Hector Garcia-Molina & Kenneth Salem [Gar 1987]

Slide 19

Slide 19 text

“Let us use the term saga to refer to a LLT (long lived transaction) that can be broken up into a collection of sub-transactions that can be interleaved in any way with other transactions. Each sub-transaction in this case is a real transaction in the sense that it preserves database consistency. However, unlike other transactions, the transactions in a saga are related to each other and should be executed as a (non-atomic) unit: any partial executions of the saga are undesirable, and if they occur, must be compensated for.” [Gar 1987]

Slide 20

Slide 20 text

“The compensating transaction [Ci ] undoes, from a semantic point of view, any of the actions performed by Ti , but does not necessarily return the database to the state that existed when the execution of Ti began.” [Gar 1987]

Slide 21

Slide 21 text

“[In some situations] it may always be possible to move forward and finish the LLT. In this case, it may not be necessary to ever compensate for an unfinished LLT.” [Gar 1987]

Slide 22

Slide 22 text

“Two ingredients are necessary to make the ideas we have presented feasible: a DBMS that supports sagas, and LLTs that are broken into sequences of transactions. […] [Gar 1987]

Slide 23

Slide 23 text

“We initially assume that compensating transactions can only encounter system failures. Later on […] we study the effects of other failures (e.g. program bugs) in compensating transactions.” [Gar 1987]

Slide 24

Slide 24 text

“Note that it is possible to have each transaction store in the database the parameters that its compensating transaction may need in the future.” [Gar 1987]

Slide 25

Slide 25 text

“In some cases it may be desirable to let the application programmer indicate through the save-point command where saga check points should be taken. This command can be issued between transactions. It forces the system to save the state of the running application program and returns a save-point identifier for future reference. The save points could then be useful in reducing the amount of work after a saga failure or a system crash: instead of compensating for all of the outstanding transactions, the system could compensate for transactions executed since the last save point, and then restart the saga.” [Gar 1987]

Slide 26

Slide 26 text

“Within the DBMS, a saga execution component (SEC) manages sagas. This component calls on the conventional transaction execution component (TEC), which manages the execution of the individual transactions. The operation of the SEC is similar to that of the TEC: the SEC executes a series of transactions as a unit, while the TEC executes a series of actions as an (atomic) unit. Both components require a log to record the activities of sagas and transactions. […] All saga commands and database actions are channeled through the SEC. ” [Gar 1987]

Slide 27

Slide 27 text

“But what happens if a compensating transaction cannot be successfully completed due to errors (e.g., it tries to read a file that does not exist, or there is a bug in the code)? The transaction could be aborted, but if it were run again it would probably encounter the same error. In this case, the system is stuck: it cannot abort the transaction nor can it complete it.” [Gar 1987]

Slide 28

Slide 28 text

“One possible solution is to make use of software fault tolerant techniques along the lines of recovery blocks. A recovery block is an alternate or secondary block of code that is provided in case a failure is detected in the primary block. The other possible solution to this problem is manual intervention. The erroneous transaction is first aborted. Then it is given to an application programmer who, given a description of the error, can correct it. The SEC (or the application) then reruns the transaction and continues processing the saga.” [Gar 1987]

Slide 29

Slide 29 text

“In our discussion of saga management we have assumed that the SEC is part of the DBMS and has direct access to the log. However, in some cases it may be desirable to run sagas on an existing DBMS that does not directly support them.” [Gar 1987]

Slide 30

Slide 30 text

“There are basically two things to do. First, the saga commands embedded in the application code become subroutine calls […]. Each subroutine stores within the database all the information that the SEC would have stored in the log.” [Gar 1987]

Slide 31

Slide 31 text

“Second, a special process must exist to implement the rest of the SEC functions. This process, the saga daemon (SD) would always be active. It would be restarted after a crash by the operating system. After a crash it would scan the saga tables to discover the status of pending sagas.” [Gar 1987]

Slide 32

Slide 32 text

“Our model for sequential transaction execution within a saga can be extended to include parallel transactions.” [Gar 1987]

Slide 33

Slide 33 text

“To identify potential sub-transactions within a LLT, one must search for natural divisions of the work being performed. In many cases, the LLT models a series of real world actions, and each of these actions is a candidate for a saga transaction. In other cases, it is the database itself that is naturally partitioned into relatively independent components, and the actions on each component can be grouped into a saga transaction.” [Gar 1987]

Slide 34

Slide 34 text

“It may even be possible to compensate for actions that are harder to undo, like sending a letter or printing a check. For example, to compensate for the letter, send a second letter explaining the problem. To compensate for the check, send a stop-payment message to the bank.” [Gar 1987]

Slide 35

Slide 35 text

“Also recall that pure forward recovery does not require compensating transactions. So if compensating transactions are hard to write, then one has the choice of tailoring the application so that LLTs do not have user initiated aborts. Without these aborts, pure forward recovery is feasible and compensation is never needed.” [Gar 1987]

Slide 36

Slide 36 text

“As has become clear from our discussion, the structure of the database plays an important role in the design of sagas. Thus, it is best not to study each LLT in isolation, but to design the entire database with LLTs and sagas in mind. That is, if the database can be laid out into a set of loosely-coupled components (with few and simple inter-component consistency constraints), then it is likely that the LLT will naturally break up into sub-transactions that can be interleaved.” [Gar 1987]

Slide 37

Slide 37 text

Whoa, that’s been a lot!

Slide 38

Slide 38 text

What have we got?

Slide 39

Slide 39 text

Concepts and definitions • Original intention is to mitigate the issues of LLTs • Saga is LLT broken up in interleavable sub-transactions • Compensation is a semantic rollback • Forward completion • Very powerful concept, but usually ignored • Can be used if no business-level errors are raised • Requires safe-points for efficient execution

Slide 40

Slide 40 text

Assumptions and constraints • Built into DBMS using its transaction manager • Saga coordinator built into DBMS • Crash failures only • Single DBMS • Transactions do not need to be idempotent à Simplified implementation and coordination conditions

Slide 41

Slide 41 text

Generalization • Persistent errors usually require human intervention • Implementation outside DBMS much harder • Even without considering distributed failure modes • Central coordinator for edge-case fault tolerance needed

Slide 42

Slide 42 text

Design hints • Think holistically • Do not design sagas in isolation • High cohesion, low coupling at all levels • Split transactions along loosely coupled boundaries

Slide 43

Slide 43 text

But what about distributed systems?

Slide 44

Slide 44 text

Distributed systems in a nutshell

Slide 45

Slide 45 text

Everything fails, all the time. -- Werner Vogels

Slide 46

Slide 46 text

Failures in distributed systems ... • Crash failure • Omission failure • Timing failure • Response failure • Byzantine failure

Slide 47

Slide 47 text

... lead to a variety of effects … • Lost messages • Incomplete messages • Duplicate messages • Distorted messages • Delayed messages • Out-of-order message arrival • Partial, out-of-sync local memory • ...

Slide 48

Slide 48 text

... turning seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"

Slide 49

Slide 49 text

A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport

Slide 50

Slide 50 text

Continuous partial failure is the normal state of affairs. -- Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing

Slide 51

Slide 51 text

The question is no longer if failures will hit you. The only question left is when and how bad they will hit you.

Slide 52

Slide 52 text

Failures in todays complex, distributed and interconnected systems are not the exception They are unavoidable They are the normal case They are not predictable

Slide 53

Slide 53 text

And how do we usually prepare for this new normal?

Slide 54

Slide 54 text

Falling in the “100% available” trap

Slide 55

Slide 55 text

An example You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”

Slide 56

Slide 56 text

Another example You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure this will not happen.”

Slide 57

Slide 57 text

More variants of the trap • Infrastructure components will never fail • E.g., OS, schedulers, routers, switches, … • Middleware components will never fail • E.g., message queues, databases, … • All encompassing applications and services will never fail • No message loss, latency, response failures, …

Slide 58

Slide 58 text

The “100% available” trap in a nutshell “Everything works perfectly, all the time. Nothing ever fails.” Successor of the “Ops is responsible for availability” mindset

Slide 59

Slide 59 text

Continuous partial failure is the normal state of affairs. -- Michael Nygard Source: https://www.cognitect.com/blog/2016/2/3/the-new-normal-failure-is-a-good-thing

Slide 60

Slide 60 text

The question is no longer if failures will hit you. The only question left is when and how bad they will hit you.

Slide 61

Slide 61 text

Avoid the “100% available” trap

Slide 62

Slide 62 text

Applying the Saga Pattern by Caitie McCaffrey [McC 2015]

Slide 63

Slide 63 text

Regarding the original saga paper • Consequences of interleaving sub-transactions • Sub-transactions cannot depend on one another • E.g., TX 2 cannot use output of TX 1 as input • Compensating transactions are also interleaving • Sagas trade atomicity for availability

Slide 64

Slide 64 text

“Sagas are a failure management pattern” [McC 2015]

Slide 65

Slide 65 text

Sagas in a distributed world • Saga Execution Coordinator (SEC) required • Saga log needs to be durable and distributed • Sub-transactions become service requests • Consistency guarantees of a service request may vary • ACID consistency cannot be assumed in general • Service requests do not need to be idempotent • Saga metadata must be available to SEC

Slide 66

Slide 66 text

Service Saga log Start sagat (d) Start sagat,i (d) SEC Service Service Service Start Ri,1 End Ri,1 (c1 ) Start Ri,2 End Ri,2 (c2 ) Start Ri,3 End Ri,3 (c3 ) End sagai Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Start Ri,3 (d) End Ri,3 (c3 ) End Ri,2 (c2 ) Metadata Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: Saga instance ID c: Compensation data Based on [McC2015]

Slide 67

Slide 67 text

Sagas in a distributed world • Saga may fail in 3 ways • Service response indicates failure (business or technical) • Calling the service fails, including timeout (technical failure) • SEC crashes and needs to be restarted (technical failure)

Slide 68

Slide 68 text

Service Saga log Start sagat (d) Start sagat,i (d) SEC Service Service Service Start Ri,1 End Ri,1 (c1 ) Start Ri,2 Abort sagai Start Ci,2 (d,∅) End Ci,2 Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Abort saga Metadata Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: Saga instance ID c: Compensation data Based on [McC2015] Start Ci,1 (d,c1 ) Start Ci,2 (d,∅) End Ci,2 Start Ci,1 (d,c1 ) End Ci,1 End Ci,1 End sagai

Slide 69

Slide 69 text

Service Start sagat (d) Saga log Start sagat,i (d) Start Ri,1 Start Ri,2 Start Ci,2 (d,∅) End sagai End Ri,1 (c1 ) Abort sagai End Ci,2 SEC Service Service Service Metadata Start Ri,1 (d) End Ri,1 (c1 ) Start Ri,2 (d) Start Ci,2 (d,∅) End Ci,2 Abort saga Metadata(Sagat ) Key t: Saga type ID d: Saga instance data i: I Saga instance ID c: Compensation data Based on [McC2015] Start Ci,1 (d,c1 ) End Ci,1 Start Ci,1 (d,c1 ) End Ci,1 What if a compensating request fails?

Slide 70

Slide 70 text

Handling compensating requests • Compensating requests • must be idempotent • must not fail due to business-level errors • SEC replays compensating requests until they succeed • Only works for transient technical errors • Persistent technical errors still require human intervention

Slide 71

Slide 71 text

Handling SEC crashes • SEC must be restarted automatically • Sagas in an unsafe state must be rolled back • Start request log entry w/o corresponding end request entry • Required because requests are not idempotent • Non-idempotent requests require at-most-once semantics Side note: Compensating requests cannot rely on the execution of the corresponding original request

Slide 72

Slide 72 text

Idempotent requests • Simplifies saga implementation • Unsafe requests can simply be replayed after SEC crash • Idempotent requests allow for at-least-once semantics • May be hard to design depending on the given context

Slide 73

Slide 73 text

Forward recovery • Prerequisites • Requests must be idempotent • Requests must not fail due to business-level errors • Compensating requests not required • Repeat requests until they eventually succeed • Allows for further simplifications and optimizations, e.g. • Execute all requests in parallel • Replay all requests if any if them failed • Only possible if given business-level context allows for it

Slide 74

Slide 74 text

Distributed sagas – Summary • Distributed sagas require • Distributed, durable log • SEC process that is restarted if it crashes • Idempotent compensating requests (must not fail due to business-level errors) • Idempotent requests enable simpler implementations

Slide 75

Slide 75 text

Note that communication is still synchronous

Slide 76

Slide 76 text

Distributed Sagas by Caitie McCaffrey [McC 2017]

Slide 77

Slide 77 text

Distributed saga protocol • Prerequisites • Idempotent requests, may fail at business level • Idempotent compensating requests, may not fail at business level • Requests and compensating requests are commutative • Fault-tolerant and highly available distributed log • SEC that is automatically restarted after crash • Allows implementing distributed saga protocol

Slide 78

Slide 78 text

Distributed saga protocol • Distributed saga protocol guarantees • either all requests eventually complete successfully • or a subset of the requests and their corresponding compensating requests are eventually executed

Slide 79

Slide 79 text

Distributed saga protocol • Helps to avoid feral concurrency control (see [Bai 2015]) • Coordination logic built into services • Mixed with business logic • Needs to cover distributed systems edge cases • Logic often hardcoded into services multiple times

Slide 80

Slide 80 text

Summary • Avoid feral concurrency control • Do not pollute services with coordination logic • Build a coordinator instead all services can simply use • Makes service developers’ life a lot easier

Slide 81

Slide 81 text

Note that asynchronous communication without a coordinator requires feral concurrency control

Slide 82

Slide 82 text

Building on Quicksand by Pat Helland & Dave Campbell [Hel 2009]

Slide 83

Slide 83 text

“Reliable systems have always been built out of unreliable components.” [Hel 2009]

Slide 84

Slide 84 text

Unfortunately, most software is written the opposite way: The software is unreliable and expects the underlying components to make the solution reliable

Slide 85

Slide 85 text

Going asynchronous …

Slide 86

Slide 86 text

What if … • a service fails while processing a message? • a service is stalled indefinitely while processing a message? • a message is lost between services? • a messages is processed more than once? • an old message re-emerges? • a message contains garbage? • a split-brain situation occurs? • …

Slide 87

Slide 87 text

How do we detect such failures (without a coordinator)? How do we handle such failures? How do we reconcile (global) state after such failures occurred?

Slide 88

Slide 88 text

Most popular saga pattern explanations ignore these failure modes blindly relying on the infrastructure and middleware to always guarantee perfect conditions

Slide 89

Slide 89 text

Welcome to the 100% available trap!

Slide 90

Slide 90 text

“The deeper observation is that two things are coupled: 1) The change from synchronous checkpointing to asynchronous to save latency, and 2) The loss of a notion of an authoritative truth.” [Hel 2009]

Slide 91

Slide 91 text

“When the backup system that participates in the enforcement of these business rules is asynchronously tied to the primary, the enforcement of these [business] rules inevitably becomes probabilistic! It is the cost/benefit analysis that values lower latency from asynchronous checkpointing higher than the risk of slippage in the business rule that generalizes the fault tolerant algorithm.” [Hel 2009]

Slide 92

Slide 92 text

“Arguably, all computing really falls into three categories: memories, guesses, and apologies. The idea is that everything is done locally with a subset of the global knowledge. You know what you know when an action is performed. Since you have only a subset of the knowledge, your actions are really only guesses. When your knowledge as a replica increases, you may have an “Oh, crap!” moment. Reconciling your actions (as a replica) with the actions of an evil-twin of yours may result in recognition that there’s a mess to clean up. That may involve apologizing for your behavior (or the behavior of a replica).” [Hel 2009]

Slide 93

Slide 93 text

Detection Current knowledge updates Activity triggers updates Compensation decision causes triggers causes observes observes Event ... Memory Guesses This part we usually implement assuming a perfect global knowledge in each node Decision causes uses Apologies This part we usually do not implement Incomplete & out-of-order ... affects

Slide 94

Slide 94 text

“In a loosely coupled world choosing some level of availability over consistency, it is best to think of all computing as memories, guesses, and apologies.” [Hel 2009]

Slide 95

Slide 95 text

Recommendation #1: Design operations as memories, guesses and apologies.

Slide 96

Slide 96 text

“In many applications, it is possible to express the business rules of the application and allow different operations to be reordered in their execution. When the operations occur on separate systems and wind their way to each other, it is essential to preserve these business rules.” [Hel 2009]

Slide 97

Slide 97 text

Recommendation #2: Design operations to be commutative and associative.

Slide 98

Slide 98 text

“The layering of an arbitrary application atop a storage subsystem inhibits reordering. Only when commutative operations are used can we achieve the desired loose coupling. Application operations can be commutative. WRITE is not commutative. Storage (i.e. READ and WRITE) is an annoying abstraction...” [Hel 2009]

Slide 99

Slide 99 text

Addendum: WRITES are neither commutative nor associative.

Slide 100

Slide 100 text

“It is essential to ensure that the work of a single operation is idempotent. This is an important design consideration in the creation of an application that can handle asynchrony as it tolerates faults and as it allows loose-coupling for latency, scale, and offline. Each operation must occur once (i.e. have the business impact of a single execution) even as it is processed (or simply logged) at multiple replicas.” [Hel 2009]

Slide 101

Slide 101 text

Recommendation #3: Design operations to be idempotent.

Slide 102

Slide 102 text

“Consider the new ACID (or ACID2.0). The letters stand for: Associative, Commutative, Idempotent, and Distributed. The goal for ACID2.0 is to succeed if the pieces of the work happen: • At least once, • Anywhere in the system, • In any order. This defines a new KIND of consistency. The individual steps happen at one or more system. The application is explicitly tolerant of work happening out of order. It is tolerant of the work happening more than once per machine, too.” [Hel 2009]

Slide 103

Slide 103 text

Combining recommendations #2 and #3: Design for ACID 2.0.

Slide 104

Slide 104 text

“When the application is constrained to the additional requirements of commutativity and associativity, the world gets a LOT easier. No longer must the state be checkpointed across failure units in a synchronous fashion. Instead, it is possible to be very lazy about the sharing of information. This opens up offline, slow links, low quality datacenters, and more.” [Hel 2009]

Slide 105

Slide 105 text

Some additional advice

Slide 106

Slide 106 text

“The major point is that availability (and its cousins offline and latency- reduction) may be traded off with classic notions of consistency. This tradeoff may frequently be applied across many different aspects at many levels of granularity within a single application.” [Hel 2009]

Slide 107

Slide 107 text

“The best model for coping with violations of the business rule is: 1. Send the problem to a human (via email or something else), 2. If that’s too expensive, write some business specific software to reduce the probability that a human needs to be involved.” [Hel 2009]

Slide 108

Slide 108 text

“Whenever the authors struggle with explaining how to implement loosely-coupled solutions, we look to how things were done before computers. In almost every case, we can find inspiration in paper forms, pneumatic tubes, and forms filed in triplicate.” [Hel 2009]

Slide 109

Slide 109 text

Summary • Reliability must be built into the application • Asynchronous communication means the loss of a notion of a global shared truth • If using asynchronous coordination • Go for ACID 2.0 • Think in terms of memory, guesses and apologies • Generalization of the saga pattern

Slide 110

Slide 110 text

Putting it all together

Slide 111

Slide 111 text

Pattern generalization Distributed system failures Omission failures, latency, network disconnect (split brain), … Business level failures Compensating actions Forward completion Transaction splitting Effects of asynchronicity Hints for implementation of special cases Hints for saga design Persistent failures Bugs, corrupted environment, ... Crash failures Sagas (original paper) ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗ ✗ Building on Quicksand – – – ✓ ✓ ✓ ✓ ✓ ✓ – – Applying the Saga Pattern Distributed Sagas ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Popular explanations ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗

Slide 112

Slide 112 text

Design considerations • Distributed failure modes must be considered • Persistent failures must be considered • Asynchronous communication changes everything • Coordination and failure handling must be implemented at the application level (cannot be “left to middleware”) • Forward recovery offers additional design options

Slide 113

Slide 113 text

Communication styles • Synchronous communication • Tighter temporal coupling but (relatively) easy to implement • Compensating actions need to be idempotent and must not raise business-level errors • Idempotent actions allow for optimized implementations • Asynchronous communication • Loose temporal coupling but challenging to implement • Make operations ACID 2.0 • Think in terms of memories, guesses and apologies

Slide 114

Slide 114 text

Coordination styles • Central coordinator • Avoids feral concurrency control • Keeps services free from coordination & fault tolerance logic • Additional component improving separation of concerns • No central coordinator • Requires feral concurrency control • Coordination and fault tolerance logic spread across services • Also needed for asynchronous communication

Slide 115

Slide 115 text

Complementing thoughts • Avoid the “100% available trap • Plan for (rare) manual intervention • Especially to handle persistent failures • Check if forward recovery is an option • Requires that actions do not raise business-level errors • Check if context-specific implementation is an option • Allows for much simpler implementations • Look at non-digital implementations for loose coupling

Slide 116

Slide 116 text

A final hint

Slide 117

Slide 117 text

If you need transactions that span multiple services often it is a design smell Check first if a service redesign would lead to a better solution before resorting to sagas

Slide 118

Slide 118 text

Wrap-up

Slide 119

Slide 119 text

Wrap-up • Most popular saga pattern explanations are insufficient • Do not take distributed failure modes into account • Fall for the “100% available” trap • Ignore challenges of asynchronous communication • Additional considerations needed for robust implementation • Carefully ponder communication and coordination style • Think about idempotency, ACID 2.0, forward recovery, … • Avoid feral concurrency control

Slide 120

Slide 120 text

“Reliable systems have always been built out of unreliable components.” [Hel 2009]

Slide 121

Slide 121 text

References (1/2) [Bai 2015] Peter Bailis et al., “Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity”, SIGMOD 2015, 2015 [Gar 1987] Hector Garcia-Molina, Kenneth Salem, ”Sagas", SIGMOD ‘87, 1987 [Hel 2005] Pat Helland, “Life beyond distributed transactions”, CIDR ‘07, 2007

Slide 122

Slide 122 text

References (2/2) [Hel 2009] Pat Helland & Dave Campbell, “Building on Quicksand”, CIDR ‘09 Perspectives, 2009 [McC 2015] Caitie McCaffrey, “Applying the Saga Pattern”, GOTO Chicago Conference 2015, https://www.youtube.com/watch?v=xDuwrtwYHu8 [McC 2017] Caitie McCaffrey, ”Distributed Sagas", J On The Beach 2017, https://www.youtube.com/watch?v=0UTOLRTwOX0

Slide 123

Slide 123 text

Uwe Friedrichsen Works @ codecentric https://twitter.com/ufried https://www.speakerdeck.com/ufried https://ufried.com/