Slide 1

Slide 1 text

@tyler_treat Distributed Systems Are a
 UX Problem Tyler Treat / O’Reilly Software Architecture Conference / October 30, 2018

Slide 2

Slide 2 text

@tyler_treat Tyler Treat
 [email protected]

Slide 3

Slide 3 text

@tyler_treat I like distributed systems.

Slide 4

Slide 4 text

@tyler_treat

Slide 5

Slide 5 text

@tyler_treat

Slide 6

Slide 6 text

@tyler_treat Disclaimer:
 I know approximately nothing about UX…

Slide 7

Slide 7 text

@tyler_treat …other than when I’m the user, I know when my experience is good and when it’s bad.

Slide 8

Slide 8 text

@tyler_treat

Slide 9

Slide 9 text

@tyler_treat UX

Slide 10

Slide 10 text

@tyler_treat UX Systems

Slide 11

Slide 11 text

@tyler_treat UX Systems

Slide 12

Slide 12 text

@tyler_treat UX Systems Business

Slide 13

Slide 13 text

@tyler_treat UX Systems Business This
 Talk

Slide 14

Slide 14 text

@tyler_treat The Yin and Yang of UX and Architecture

Slide 15

Slide 15 text

@tyler_treat Monolith

Slide 16

Slide 16 text

@tyler_treat Monolith

Slide 17

Slide 17 text

@tyler_treat Service Service Service Service Service Service Service Serv Service

Slide 18

Slide 18 text

@tyler_treat Service Service Service Service Service Service Service Serv Service

Slide 19

Slide 19 text

@tyler_treat Service Service Service Service Service Service Service Serv Service

Slide 20

Slide 20 text

@tyler_treat Implications

Slide 21

Slide 21 text

@tyler_treat

Slide 22

Slide 22 text

@tyler_treat book trip Trip Service Trip Database transaction Good old days

Slide 23

Slide 23 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction

Slide 24

Slide 24 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction ACID ACID ACID

Slide 25

Slide 25 text

@tyler_treat UX Implications of Microservices • Data consistency

Slide 26

Slide 26 text

@tyler_treat Service Service Service Service Service Service Service Serv Service

Slide 27

Slide 27 text

@tyler_treat Service Service Service Service Service Service Service Serv Service

Slide 28

Slide 28 text

@tyler_treat UX Implications of Microservices • Data consistency • Race conditions

Slide 29

Slide 29 text

@tyler_treat

Slide 30

Slide 30 text

@tyler_treat UX Implications of Microservices • Data consistency • Race conditions • Performance

Slide 31

Slide 31 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction

Slide 32

Slide 32 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction

Slide 33

Slide 33 text

@tyler_treat UX Implications of Microservices • Data consistency • Race conditions • Performance • Partial failure

Slide 34

Slide 34 text

@tyler_treat So are microservices bad?

Slide 35

Slide 35 text

@tyler_treat Microservices are about
 people scale.

Slide 36

Slide 36 text

@tyler_treat Transparency

Slide 37

Slide 37 text

@tyler_treat A Study of Transparency and Adaptability of Heterogeneous Computer Networks with TCP/IP and IPv6 Protocols
 Das, 2012 “Any change in a computing system, such as a new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behavior.”

Slide 38

Slide 38 text

@tyler_treat System

Slide 39

Slide 39 text

@tyler_treat System

Slide 40

Slide 40 text

@tyler_treat High Transparency Low Transparency

Slide 41

Slide 41 text

@tyler_treat NFS High Transparency Low Transparency

Slide 42

Slide 42 text

@tyler_treat NFS FTP High Transparency Low Transparency

Slide 43

Slide 43 text

@tyler_treat Types of Transparencies Access transparency Location transparency Migration transparency Relocation transparency Replication transparency Concurrent transparency Failure transparency Persistence transparency Security transparency

Slide 44

Slide 44 text

@tyler_treat Transparency is about usability.

Slide 45

Slide 45 text

@tyler_treat Usability Control

Slide 46

Slide 46 text

@tyler_treat Usability Control

Slide 47

Slide 47 text

@tyler_treat Usability Control

Slide 48

Slide 48 text

@tyler_treat Simplicity Flexibility, Performance,
 Correctness RPC

Slide 49

Slide 49 text

@tyler_treat Simplicity Flexibility, Performance,
 Correctness Erlang Message Passing

Slide 50

Slide 50 text

@tyler_treat RPC Erlang
 Message Passing High Transparency Low Transparency

Slide 51

Slide 51 text

@tyler_treat Translating UX for developers: APIs

Slide 52

Slide 52 text

@tyler_treat Transparencies simplify the API of a system.

Slide 53

Slide 53 text

@tyler_treat UX is about deciding what knobs to expose.

Slide 54

Slide 54 text

@tyler_treat The Truth is Prohibitively Expensive Balancing Consistency and UX

Slide 55

Slide 55 text

@tyler_treat book trip Trip Service Trip Database transaction Good old days

Slide 56

Slide 56 text

@tyler_treat book trip Trip Service Trip Database transaction Good old days Transparency

Slide 57

Slide 57 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction Transparency

Slide 58

Slide 58 text

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service Trip Service transaction transaction transaction ACID ACID ACID Transparency

Slide 59

Slide 59 text

@tyler_treat

Slide 60

Slide 60 text

@tyler_treat

Slide 61

Slide 61 text

@tyler_treat

Slide 62

Slide 62 text

@tyler_treat Spreadsheet service

Slide 63

Slide 63 text

@tyler_treat Spreadsheet service Document service

Slide 64

Slide 64 text

@tyler_treat Spreadsheet service Document service Presentation service

Slide 65

Slide 65 text

@tyler_treat Spreadsheet service Document service Presentation service IAM service

Slide 66

Slide 66 text

@tyler_treat Spreadsheet service Document service Presentation service IAM service consistent

Slide 67

Slide 67 text

@tyler_treat Consistency is about ordering of events in a distributed system.

Slide 68

Slide 68 text

@tyler_treat Why is this hard?

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

@tyler_treat So what can we do?

Slide 71

Slide 71 text

@tyler_treat Coordinate

Slide 72

Slide 72 text

@tyler_treat Two-Phase Commit

Slide 73

Slide 73 text

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car Service Trip Service propose propose propose

Slide 74

Slide 74 text

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car Service Trip Service vote vote vote

Slide 75

Slide 75 text

@tyler_treat book trip 2PC Commit Airline Service Hotel Service Car Service Trip Service commit/abort commit/abort commit/abort

Slide 76

Slide 76 text

@tyler_treat book trip 2PC Commit Airline Service Hotel Service Car Service Trip Service done done done

Slide 77

Slide 77 text

@tyler_treat Problems with 2PC • Chatty protocol: beholden to network latency • Limited throughput • Transaction coordinator: single point of failure • Blocking protocol: susceptible to deadlock

Slide 78

Slide 78 text

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car Service Trip Service propose propose propose

Slide 79

Slide 79 text

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car Service Trip Service propose propose propose

Slide 80

Slide 80 text

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car Service Trip Service propose propose propose

Slide 81

Slide 81 text

@tyler_treat Add more phases!

Slide 82

Slide 82 text

@tyler_treat Three-Phase Commit

Slide 83

Slide 83 text

@tyler_treat

Slide 84

Slide 84 text

@tyler_treat atomic clocks NTP GPS TrueTime

Slide 85

Slide 85 text

@tyler_treat Good news:
 we solved physics.

Slide 86

Slide 86 text

@tyler_treat Bad news:
 it costs all the money.

Slide 87

Slide 87 text

@tyler_treat Not exactly…

Slide 88

Slide 88 text

@tyler_treat Spanner: Google’s Globally-Distributed Database
 Corbett et al.

Slide 89

Slide 89 text

@tyler_treat TrueTime forces that uncertainty to the surface, and Spanner provides a transparency over it.

Slide 90

Slide 90 text

@tyler_treat Spanner doesn’t avoid trade-offs, it just minimizes their probability.

Slide 91

Slide 91 text

@tyler_treat Spanner is expensive and proprietary.

Slide 92

Slide 92 text

@tyler_treat But it’s not the end of the story…

Slide 93

Slide 93 text

@tyler_treat Unless every service is backed by the same database, you probably still have to deal with consistency problems.

Slide 94

Slide 94 text

@tyler_treat Challenges to Adopting Stronger Consistency at Scale
 Ajoux et al., 2015 “The biggest barrier to providing stronger consistency guarantees…is that the consistency mechanism must integrate consistency across many stateful services.”

Slide 95

Slide 95 text

@tyler_treat Coordination is expensive because processes can’t make progress independently.

Slide 96

Slide 96 text

@tyler_treat

Slide 97

Slide 97 text

@tyler_treat

Slide 98

Slide 98 text

@tyler_treat Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

Slide 99

Slide 99 text

@tyler_treat And what about partial failure?

Slide 100

Slide 100 text

@tyler_treat

Slide 101

Slide 101 text

@tyler_treat

Slide 102

Slide 102 text

@tyler_treat

Slide 103

Slide 103 text

@tyler_treat

Slide 104

Slide 104 text

@tyler_treat

Slide 105

Slide 105 text

@tyler_treat Memories, Guesses, and Apologies Dealing with Partial Knowledge

Slide 106

Slide 106 text

@tyler_treat The cost of knowing the “truth” can be prohibitively expensive.

Slide 107

Slide 107 text

@tyler_treat And partial failure means the “truth” is also fragile.

Slide 108

Slide 108 text

@tyler_treat Where does this leave us?

Slide 109

Slide 109 text

@tyler_treat We could go back to the monolith.

Slide 110

Slide 110 text

@tyler_treat We could build expensive data centers with fancy hardware… @tyler_treat

Slide 111

Slide 111 text

@tyler_treat …or we could rethink our transparencies.

Slide 112

Slide 112 text

@tyler_treat @tyler_treat

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

@tyler_treat Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

Slide 115

Slide 115 text

@tyler_treat Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

Slide 116

Slide 116 text

@tyler_treat Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

Slide 117

Slide 117 text

@tyler_treat Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

Slide 118

Slide 118 text

@tyler_treat Exception Handling in Asynchronous Systems

Slide 119

Slide 119 text

@tyler_treat

Slide 120

Slide 120 text

@tyler_treat Exception Handling in Asynchronous Systems • Write-off

Slide 121

Slide 121 text

@tyler_treat

Slide 122

Slide 122 text

@tyler_treat Exception Handling in Asynchronous Systems • Write-off • Retry

Slide 123

Slide 123 text

@tyler_treat

Slide 124

Slide 124 text

@tyler_treat Exception Handling in Asynchronous Systems • Write-off • Retry • Compensating action

Slide 125

Slide 125 text

@tyler_treat Revisiting Two-Phase Commit

Slide 126

Slide 126 text

@tyler_treat Sagas

Slide 127

Slide 127 text

@tyler_treat Sagas
 Garcia-Molina & Salem, 1987 “A long-lived transaction is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions…Either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.”

Slide 128

Slide 128 text

@tyler_treat Sagas
 Garcia-Molina & Salem, 1987 “A long-lived transaction is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions…Either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.”

Slide 129

Slide 129 text

@tyler_treat Sagas split long-lived transactions into individual, interleaved sub-transactions: T = T1 , T2 , . . . , Tn

Slide 130

Slide 130 text

@tyler_treat And each sub-transaction has a compensating transaction: C1 , C2 , . . . , Cn

Slide 131

Slide 131 text

@tyler_treat T1 , T2 , . . . , Tn T1 , T2 , . . . , Tj , Cj , . . . , C2 , C1 Sagas guarantee one of two execution sequences:

Slide 132

Slide 132 text

@tyler_treat book trip Airline Service Hotel Service Car Service Trip Service transaction transaction transaction

Slide 133

Slide 133 text

@tyler_treat • Book flight • Book hotel • Book car • Charge money T = T1 , T2 , . . . , Tn

Slide 134

Slide 134 text

@tyler_treat • Cancel flight • Cancel hotel • Cancel car • Refund money C1 , C2 , . . . , Cn

Slide 135

Slide 135 text

@tyler_treat Compensating transactions must be idempotent.

Slide 136

Slide 136 text

@tyler_treat Sagas trade off isolation for availability.

Slide 137

Slide 137 text

@tyler_treat Event-Driven

Slide 138

Slide 138 text

@tyler_treat book trip Airline Service Hotel Service Car Service Trip Service transaction transaction transaction

Slide 139

Slide 139 text

@tyler_treat event Airline Service Hotel Service Car Service Trip Service event event event

Slide 140

Slide 140 text

@tyler_treat event Airline Service Hotel Service Car Service Trip Service event event event

Slide 141

Slide 141 text

@tyler_treat System Properties Business Rules

Slide 142

Slide 142 text

@tyler_treat Sean T. Allen “People don’t want distributed transactions, they just want the guarantees that distributed transactions give them.”

Slide 143

Slide 143 text

@tyler_treat CAP theorem

Slide 144

Slide 144 text

@tyler_treat CAP Theorem • Consistency, Availability, Partition Tolerance • When a partition occurs, do we: • Choose availability and give up consistency?
 
 - or - • Choose consistency and give up availability?

Slide 145

Slide 145 text

@tyler_treat CAP Theorem • Consistency, Availability, Partition Tolerance • When a partition occurs, do we: • Choose availability and give up consistency?
 
 - or - • Choose consistency and give up availability? (or YOLO it)

Slide 146

Slide 146 text

@tyler_treat The CAP theorem is a UX question…

Slide 147

Slide 147 text

@tyler_treat When a partial failure occurs, how do you want the application to behave?

Slide 148

Slide 148 text

@tyler_treat

Slide 149

Slide 149 text

@tyler_treat

Slide 150

Slide 150 text

@tyler_treat We can choose consistency and sacrifice availability…

Slide 151

Slide 151 text

@tyler_treat …or we can choose availability by making local decisions with the knowledge at hand and designing the UX accordingly.

Slide 152

Slide 152 text

@tyler_treat Managing partial failure is a matter of dealing with partial knowledge…

Slide 153

Slide 153 text

@tyler_treat …and managing risk.

Slide 154

Slide 154 text

@tyler_treat Check value
 < $10,000? Our risk appetite can drive business rules. Clear locally Double check with
 all replicas before
 clearing yes no

Slide 155

Slide 155 text

@tyler_treat Memories, guesses, and apologies

Slide 156

Slide 156 text

@tyler_treat Computers operate with partial knowledge.

Slide 157

Slide 157 text

@tyler_treat Either there’s a disconnect with the “real world”…

Slide 158

Slide 158 text

@tyler_treat …or there’s a disconnect between systems.

Slide 159

Slide 159 text

@tyler_treat Systems don’t make decisions, they make guesses.

Slide 160

Slide 160 text

@tyler_treat Systems have memory.

Slide 161

Slide 161 text

@tyler_treat Memories help systems make better guesses in the future.

Slide 162

Slide 162 text

@tyler_treat Forgetfulness is a business decision.

Slide 163

Slide 163 text

@tyler_treat Sometimes the system guesses wrong.

Slide 164

Slide 164 text

@tyler_treat Systems need the capacity to apologize.

Slide 165

Slide 165 text

@tyler_treat Customers judge you not by your failures, but by how you handle your failures.

Slide 166

Slide 166 text

@tyler_treat Are you building systems that never fail or systems that fail gracefully?

Slide 167

Slide 167 text

@tyler_treat

Slide 168

Slide 168 text

@tyler_treat Businesses need both code and people to manage apologies.

Slide 169

Slide 169 text

@tyler_treat It becomes less about trying to build the perfect system and more about how we cope with an imperfect one.

Slide 170

Slide 170 text

@tyler_treat Wrapping Up Summary and Observations

Slide 171

Slide 171 text

@tyler_treat

Slide 172

Slide 172 text

@tyler_treat @tyler_treat

Slide 173

Slide 173 text

@tyler_treat ACID distributed transactions exactly-once delivery ordered delivery serializable isolation linearizability System Properties

Slide 174

Slide 174 text

@tyler_treat ACID distributed transactions exactly-once delivery ordered delivery serializable isolation linearizability System Properties negative account balance Business Rules / Application Invariants two users sharing same ID room double-booked balance reconciles

Slide 175

Slide 175 text

@tyler_treat

Slide 176

Slide 176 text

@tyler_treat We put ourselves at the mercy of our infrastructure and hope it makes good on its promises.

Slide 177

Slide 177 text

@tyler_treat Kyle Kingsbury, 2015 http://jepsen.io It often doesn’t.

Slide 178

Slide 178 text

@tyler_treat When do we actually need consistency?

Slide 179

Slide 179 text

@tyler_treat

Slide 180

Slide 180 text

@tyler_treat We can use consistency when the stakes are high and the cost is worth it.

Slide 181

Slide 181 text

@tyler_treat And design our transparencies accordingly.

Slide 182

Slide 182 text

@tyler_treat We could try to build perfect systems.

Slide 183

Slide 183 text

@tyler_treat Should we build perfect systems or pragmatic systems?

Slide 184

Slide 184 text

@tyler_treat Systems that can compensate.

Slide 185

Slide 185 text

@tyler_treat Systems that can recover.

Slide 186

Slide 186 text

@tyler_treat Systems that can apologize.

Slide 187

Slide 187 text

@tyler_treat UX Systems Business

Slide 188

Slide 188 text

@tyler_treat Data Consistency Race Conditions Performance Partial Failure

Slide 189

Slide 189 text

@tyler_treat Data Consistency Race Conditions Performance Partial Failure Transparency Informs

Slide 190

Slide 190 text

@tyler_treat Thank You bravenewgeek.com
 realkinetic.com

Slide 191

Slide 191 text

@tyler_treat References • https://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf • http://ijcsits.org/papers/vol2no62012/42vol2no6.pdf • http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf • https://queue.acm.org/detail.cfm?id=2745385 • https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf • http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf • https://bravenewgeek.com/distributed-systems-are-a-ux-problem/ • http://www.cs.princeton.edu/~wlloyd/papers/challenges-hotos15.pdf • https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf • https://www.youtube.com/watch?v=lsKaNDj4TrE • Starbucks photo - https://www.geekwire.com/2015/starbucks-mobile-ordering-now-blankets-the-u-s-with-coverage-in-san-francisco-new-york-and-more-coming-today/ • Friction image - https://byjus.com/physics/friction-in-automobiles/ • Carbon copy forms - http://www.rainiercopy.com/forms.html • Rosetta Stone photo - https://en.wikipedia.org/wiki/Rosetta_Stone#/media/File:Rosetta_Stone.JPG