The Road to Akka Cluster, and Beyond…

Slide 1

Slide 1 text

The Road Jonas Bonér CTO Typesafe @jboner to Akka Cluster and Beyond…

Slide 2

Slide 2 text

What is a Distributed System?

Slide 3

Slide 3 text

What is a and Why would You Need one? Distributed System?

Slide 4

Slide 4 text

Distributed Computing is the New normal

Slide 5

Slide 5 text

Distributed Computing is the New normal you already have a distributed system, WHETHER you want it or not

Slide 6

Slide 6 text

Distributed Computing is the New normal you already have a distributed system, WHETHER you want it or not Mobile NOSQL Databases Cloud & REST Services SQL Replication

Slide 7

Slide 7 text

essence of distributed computing? What is the

Slide 8

Slide 8 text

essence of distributed computing? overcome 1. Information travels at the speed of light 2. Independent things fail independently What is the It’s to try to

Slide 9

Slide 9 text

Why do we need it?

Slide 10

Slide 10 text

Why do we need it? Elasticity When you outgrow the resources of a single node

Slide 11

Slide 11 text

Why do we need it? Elasticity When you outgrow the resources of a single node Availability Providing resilience if one node fails

Slide 12

Slide 12 text

Why do we need it? Elasticity When you outgrow the resources of a single node Availability Providing resilience if one node fails Rich stateful clients

Slide 13

Slide 13 text

So, what’s the problem?

Slide 14

Slide 14 text

It is still Very Hard So, what’s the problem?

Slide 15

Slide 15 text

The network is Inherently Unreliable

Slide 16

Slide 16 text

You can’t tell the DIFFERENCE Between a Slow NODE and a Dead NODE

Slide 17

Slide 17 text

Fallacies Peter Deutsch’s 8 Fallacies of Distributed Computing

Slide 18

Slide 18 text

Fallacies 1. The network is reliable 2. Latency is zero 3. Bandwidth is infinite 4. The network is secure 5. Topology doesn't change 6. There is one administrator 7. Transport cost is zero 8. The network is homogeneous Peter Deutsch’s 8 Fallacies of Distributed Computing

Slide 19

Slide 19 text

So, oh yes…

Slide 20

Slide 20 text

It is still Very Hard So, oh yes…

Slide 21

Slide 21 text

1. Guaranteed Delivery 2. Synchronous RPC 3. Distributed Objects 4. Distributed Shared Mutable State 5. Serializable Distributed Transactions Graveyard of distributed systems

Slide 22

Slide 22 text

Partition for scale Replicate for resilience General strategies Divide & Conquer

Slide 23

Slide 23 text

WHICH Requires SHARE NOTHING  Designs General strategies Asynchronous Message-Passing

Slide 24

Slide 24 text

WHICH Requires SHARE NOTHING  Designs General strategies Location Transparency Asynchronous Message-Passing ISolation & Containment

Slide 25

Slide 25 text

theoretical Models

Slide 26

Slide 26 text

A model for distributed Computation Should Allow explicit reasoning abouT 1. Concurrency 2. Distribution 3. Mobility Carlos Varela 2013

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Lambda Calculus Alonzo Church 1930

Slide 29

Slide 29 text

Lambda Calculus state Immutable state Managed through functional application Referential transparent Alonzo Church 1930

Slide 30

Slide 30 text

order β-reduction—can be performed in any order Normal order Applicative order Call-by-name order Call-by-value order Call-by-need order Lambda Calculus state Immutable state Managed through functional application Referential transparent Alonzo Church 1930

Slide 31

Slide 31 text

Even in parallel order β-reduction—can be performed in any order Normal order Applicative order Call-by-name order Call-by-value order Call-by-need order Lambda Calculus state Immutable state Managed through functional application Referential transparent Alonzo Church 1930

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

Memory Control Unit Arithmetic Logic Unit Input Output Accumulator Von neumann machine John von Neumann 1945

Slide 37

Slide 37 text

Von neumann machine John von Neumann 1945

Slide 38

Slide 38 text

Von neumann machine state Mutable state In-place updates John von Neumann 1945

Slide 39

Slide 39 text

order Total order List of instructions Array of memory Von neumann machine state Mutable state In-place updates John von Neumann 1945

Slide 40

Slide 40 text

order Total order List of instructions Array of memory Von neumann machine state Mutable state In-place updates John von Neumann 1945 No model for Concurrency

Slide 41

Slide 41 text

order Total order List of instructions Array of memory Von neumann machine state Mutable state In-place updates John von Neumann 1945 No model for Concurrency No model for Distribution

Slide 42

Slide 42 text

order Total order List of instructions Array of memory Von neumann machine state Mutable state In-place updates John von Neumann 1945 No model for Concurrency No model for Distribution No model for Mobility

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

transactions Jim Gray 1981

Slide 45

Slide 45 text

transactions state Isolation of updates Atomicity Jim Gray 1981

Slide 46

Slide 46 text

order Serializability Disorder across transactions Illusion of order within transactions transactions state Isolation of updates Atomicity Jim Gray 1981

Slide 47

Slide 47 text

order Serializability Disorder across transactions Illusion of order within transactions transactions state Isolation of updates Atomicity Jim Gray 1981 Concurrency Works Work Well

Slide 48

Slide 48 text

order Serializability Disorder across transactions Illusion of order within transactions transactions state Isolation of updates Atomicity Jim Gray 1981 Concurrency Works Work Well Distribution Does Not Work Well

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

actors Carl HEWITT 1973

Slide 51

Slide 51 text

actors state Share nothing Atomicity within the actor Carl HEWITT 1973

Slide 52

Slide 52 text

order Async message passing Non-determinism in message delivery actors state Share nothing Atomicity within the actor Carl HEWITT 1973

Slide 53

Slide 53 text

order Async message passing Non-determinism in message delivery actors state Share nothing Atomicity within the actor Carl HEWITT 1973 Great model for Concurrency

Slide 54

Slide 54 text

order Async message passing Non-determinism in message delivery actors state Share nothing Atomicity within the actor Carl HEWITT 1973 Great model for Concurrency Great model for Distribution

Slide 55

Slide 55 text

order Async message passing Non-determinism in message delivery actors state Share nothing Atomicity within the actor Carl HEWITT 1973 Great model for Concurrency Great model for Distribution Great model for Mobility

Slide 56

Slide 56 text

other interesting models That are suitable for distributed systems 1. Pi Calculus 2. Ambient Calculus 3. Join Calculus

Slide 57

Slide 57 text

state of the The Art

Slide 58

Slide 58 text

Impossibility Theorems

Slide 59

Slide 59 text

Impossibility of Distributed Consensus with One Faulty Process

Slide 60

Slide 60 text

Impossibility of Distributed Consensus with One Faulty Process FLP Fischer Lynch Paterson 1985

Slide 61

Slide 61 text

Impossibility of Distributed Consensus with One Faulty Process FLP Fischer Lynch Paterson 1985 Consensus is impossible

Slide 62

Slide 62 text

Impossibility of Distributed Consensus with One Faulty Process FLP “The FLP result shows that in an asynchronous setting, where only one processor might crash, there is no distributed algorithm that solves the consensus problem” - The Paper Trail Fischer Lynch Paterson 1985 Consensus is impossible

Slide 63

Slide 63 text

Impossibility of Distributed Consensus with One Faulty Process FLP Fischer Lynch Paterson 1985

Slide 64

Slide 64 text

Impossibility of Distributed Consensus with One Faulty Process FLP “These results do not show that such problems cannot be “solved” in practice; rather, they point up the need for more refined models of distributed computing” - FLP paper Fischer Lynch Paterson 1985

Slide 65

Slide 65 text

No content

Slide 66

Slide 66 text

CAP Theorem

Slide 67

Slide 67 text

Linearizability is impossible CAP Theorem

Slide 68

Slide 68 text

Conjecture by Eric Brewer 2000 Proof by Lynch & Gilbert 2002 Linearizability is impossible CAP Theorem

Slide 69

Slide 69 text

Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Conjecture by Eric Brewer 2000 Proof by Lynch & Gilbert 2002 Linearizability is impossible CAP Theorem

Slide 70

Slide 70 text

linearizability

Slide 71

Slide 71 text

linearizability “Under linearizable consistency, all operations appear to have executed atomically in an order that is consistent with the global real-time ordering of operations.” Herlihy & Wing 1991

Slide 72

Slide 72 text

Slide 73

Slide 73 text

dissecting CAP

Slide 74

Slide 74 text

dissecting CAP 1. Very influential—but very NARROW scope

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Slide 77

Slide 77 text

Slide 78

Slide 78 text

dissecting CAP 1. Very influential—but very NARROW scope 2. “[CAP] has lead to confusion and misunderstandings regarding replica consistency, transactional isolation and high availability” - Bailis et.al in HAT paper 3. Linearizability is very often NOT required 4. Ignores LATENCY—but in practice latency & partitions are deeply related 5. Partitions are RARE—so why sacrifice C or A ALL the time?

Slide 79

Slide 79 text

Slide 80

Slide 80 text

Slide 81

Slide 81 text

consensus

Slide 82

Slide 82 text

consensus “The problem of reaching agreement among remote processes is one of the most fundamental problems in distributed computing and is at the core of many algorithms for distributed data processing, distributed ﬁle management, and fault- tolerant distributed applications.” Fischer, Lynch & Paterson 1985

Slide 83

Slide 83 text

Consistency models

Slide 84

Slide 84 text

Consistency models Strong

Slide 85

Slide 85 text

Consistency models Strong Weak

Slide 86

Slide 86 text

Consistency models Strong Weak Eventual

Slide 87

Slide 87 text

Time & Order

Slide 88

Slide 88 text

Last write wins global clock timestamp

Slide 89

Slide 89 text

Last write wins global clock timestamp

Slide 90

Slide 90 text

lamport clocks logical clock causal consistency Leslie lamport 1978

Slide 91

Slide 91 text

lamport clocks logical clock causal consistency Leslie lamport 1978 1. When a process does work, increment the counter

Slide 92

Slide 92 text

lamport clocks logical clock causal consistency Leslie lamport 1978 1. When a process does work, increment the counter 2. When a process sends a message, include the counter

Slide 93

Slide 93 text

lamport clocks logical clock causal consistency Leslie lamport 1978 1. When a process does work, increment the counter 2. When a process sends a message, include the counter 3. When a message is received, merge the counter (set the counter to max(local, received) + 1)

Slide 94

Slide 94 text

vector clocks Extends lamport clocks colin fidge 1988

Slide 95

Slide 95 text

vector clocks Extends lamport clocks colin fidge 1988 1. Each node owns and increments its own Lamport Clock

Slide 96

Slide 96 text

vector clocks Extends lamport clocks colin fidge 1988 1. Each node owns and increments its own Lamport Clock [node -> lamport clock]

Slide 97

Slide 97 text

vector clocks Extends lamport clocks colin fidge 1988 1. Each node owns and increments its own Lamport Clock [node -> lamport clock]

Slide 98

Slide 98 text

vector clocks Extends lamport clocks colin fidge 1988 1. Each node owns and increments its own Lamport Clock [node -> lamport clock] 2. Alway keep the full history of all increments

Slide 99

Slide 99 text

vector clocks Extends lamport clocks colin fidge 1988 1. Each node owns and increments its own Lamport Clock [node -> lamport clock] 2. Alway keep the full history of all increments 3. Merges by calculating the max—monotonic merge

Slide 100

Slide 100 text

Quorum

Slide 101

Slide 101 text

Quorum Strict majority vote

Slide 102

Slide 102 text

Quorum Strict majority vote Sloppy partial vote

Slide 103

Slide 103 text

Quorum Strict majority vote Sloppy partial vote • Most use R + W > N 㱺 R & W overlap

Slide 104

Slide 104 text

Quorum Strict majority vote Sloppy partial vote • Most use R + W > N 㱺 R & W overlap • If N / 2 + 1 is still alive 㱺 all good

Slide 105

Slide 105 text

Quorum Strict majority vote Sloppy partial vote • Most use R + W > N 㱺 R & W overlap • If N / 2 + 1 is still alive 㱺 all good • Most use N ⩵ 3

Slide 106

Slide 106 text

failure Detection

Slide 107

Slide 107 text

Failure detection Formal model

Slide 108

Slide 108 text

Strong completeness Failure detection Formal model

Slide 109

Slide 109 text

Strong completeness Every crashed process is eventually suspected by every correct process Failure detection Formal model

Slide 110

Slide 110 text

Strong completeness Every crashed process is eventually suspected by every correct process Failure detection Formal model Everyone knows

Slide 111

Slide 111 text

Strong completeness Every crashed process is eventually suspected by every correct process Weak completeness Failure detection Formal model Everyone knows

Slide 112

Slide 112 text

Slide 113

Slide 113 text

Slide 114

Slide 114 text

Slide 115

Slide 115 text

Strong completeness Every crashed process is eventually suspected by every correct process Weak completeness Every crashed process is eventually suspected by some correct process Strong accuracy No correct process is suspected ever Failure detection Formal model Everyone knows Someone knows

Slide 116

Slide 116 text

Slide 117

Slide 117 text

Strong completeness Every crashed process is eventually suspected by every correct process Weak completeness Every crashed process is eventually suspected by some correct process Strong accuracy No correct process is suspected ever Weak accuracy Failure detection No false positives Formal model Everyone knows Someone knows

Slide 118

Slide 118 text

Strong completeness Every crashed process is eventually suspected by every correct process Weak completeness Every crashed process is eventually suspected by some correct process Strong accuracy No correct process is suspected ever Weak accuracy Some correct process is never suspected Failure detection No false positives Formal model Everyone knows Someone knows

Slide 119

Slide 119 text

Strong completeness Every crashed process is eventually suspected by every correct process Weak completeness Every crashed process is eventually suspected by some correct process Strong accuracy No correct process is suspected ever Weak accuracy Some correct process is never suspected Failure detection No false positives Some false positives Formal model Everyone knows Someone knows

Slide 120

Slide 120 text

Accrual Failure detector Hayashibara et. al. 2004

Slide 121

Slide 121 text

Keeps history of heartbeat statistics Accrual Failure detector Hayashibara et. al. 2004

Slide 122

Slide 122 text

Keeps history of heartbeat statistics Decouples monitoring from interpretation Accrual Failure detector Hayashibara et. al. 2004

Slide 123

Slide 123 text

Keeps history of heartbeat statistics Decouples monitoring from interpretation Calculates a likelihood (phi value) that the process is down Accrual Failure detector Hayashibara et. al. 2004

Slide 124

Slide 124 text

Slide 125

Slide 125 text

Slide 126

Slide 126 text

Not YES or NO Keeps history of heartbeat statistics Decouples monitoring from interpretation Calculates a likelihood (phi value) that the process is down Accrual Failure detector Hayashibara et. al. 2004 Takes network hiccups into account phi = -log10(1 - F(timeSinceLastHeartbeat)) F is the cumulative distribution function of a normal distribution with mean and standard deviation estimated from historical heartbeat inter-arrival times

Slide 127

Slide 127 text

SWIM Failure detector das et. al. 2002

Slide 128

Slide 128 text

SWIM Failure detector das et. al. 2002 Separates heartbeats from cluster dissemination

Slide 129

Slide 129 text

SWIM Failure detector das et. al. 2002 Separates heartbeats from cluster dissemination Quarantine: suspected 㱺 time window 㱺 faulty

Slide 130

Slide 130 text

SWIM Failure detector das et. al. 2002 Separates heartbeats from cluster dissemination Quarantine: suspected 㱺 time window 㱺 faulty Delegated heartbeat to bridge network splits

Slide 131

Slide 131 text

byzantine Failure detector liskov et. al. 1999

Slide 132

Slide 132 text

Supports misbehaving processes byzantine Failure detector liskov et. al. 1999

Slide 133

Slide 133 text

Supports misbehaving processes byzantine Failure detector liskov et. al. 1999 Omission failures

Slide 134

Slide 134 text

Supports misbehaving processes byzantine Failure detector liskov et. al. 1999 Omission failures Crash failures, failing to receive a request, or failing to send a response

Slide 135

Slide 135 text

Supports misbehaving processes byzantine Failure detector liskov et. al. 1999 Omission failures Crash failures, failing to receive a request, or failing to send a response Commission failures

Slide 136

Slide 136 text

Supports misbehaving processes byzantine Failure detector liskov et. al. 1999 Omission failures Crash failures, failing to receive a request, or failing to send a response Commission failures Processing a request incorrectly, corrupting local state, and/or sending an incorrect or inconsistent response to a request

Slide 137

Slide 137 text

Slide 138

Slide 138 text

replication

Slide 139

Slide 139 text

Active (Push) ! Asynchronous Types of replication Passive (Pull) ! Synchronous VS VS

Slide 140

Slide 140 text

master/slave Replication

Slide 141

Slide 141 text

Tree replication

Slide 142

Slide 142 text

master/master Replication

Slide 143

Slide 143 text

buddy Replication

Slide 144

Slide 144 text

buddy Replication

Slide 145

Slide 145 text

analysis of replication consensus strategies Ryan Barrett 2009

Slide 146

Slide 146 text

Strong Consistency

Slide 147

Slide 147 text

Distributed transactions Strikes Back

Slide 148

Slide 148 text

Highly Available Transactions Peter Bailis et. al. 2013 CAP HAT NOT

Slide 149

Slide 149 text

Executive Summary Highly Available Transactions Peter Bailis et. al. 2013 CAP HAT NOT

Slide 150

Slide 150 text

Executive Summary • Most SQL DBs do not provide Serializability, but weaker guarantees— for performance reasons Highly Available Transactions Peter Bailis et. al. 2013 CAP HAT NOT

Slide 151

Slide 151 text

Executive Summary • Most SQL DBs do not provide Serializability, but weaker guarantees— for performance reasons • Some weaker transaction guarantees are possible to implement in a HA manner Highly Available Transactions Peter Bailis et. al. 2013 CAP HAT NOT

Slide 152

Slide 152 text

Executive Summary • Most SQL DBs do not provide Serializability, but weaker guarantees— for performance reasons • Some weaker transaction guarantees are possible to implement in a HA manner • What transaction semantics can be provided with HA? Highly Available Transactions Peter Bailis et. al. 2013 CAP HAT NOT

Slide 153

Slide 153 text

HAT

Slide 154

Slide 154 text

UnAvailable • Serializable • Snapshot Isolation • Repeatable Read • Cursor Stability • etc. Highly Available • Read Committed • Read Uncommited • Read Your Writes • Monotonic Atomic View • Monotonic Read/Write • etc. HAT

Slide 155

Slide 155 text

Other scalable or Highly Available Transactional Research

Slide 156

Slide 156 text

Other scalable or Highly Available Transactional Research Bolt-On Consistency Bailis et. al. 2013

Slide 157

Slide 157 text

Other scalable or Highly Available Transactional Research Bolt-On Consistency Bailis et. al. 2013 Calvin Thompson et. al. 2012

Slide 158

Slide 158 text

Other scalable or Highly Available Transactional Research Bolt-On Consistency Bailis et. al. 2013 Calvin Thompson et. al. 2012 Spanner (Google) Corbett et. al. 2012

Slide 159

Slide 159 text

consensus Protocols

Slide 160

Slide 160 text

Specification

Slide 161

Slide 161 text

Specification Properties

Slide 162

Slide 162 text

Events 1. Request(v) 2. Decide(v) Specification Properties

Slide 163

Slide 163 text

Events 1. Request(v) 2. Decide(v) Specification Properties 1. Termination: every process eventually decides on a value v

Slide 164

Slide 164 text

Events 1. Request(v) 2. Decide(v) Specification Properties 1. Termination: every process eventually decides on a value v 2. Validity: if a process decides v, then v was proposed by some process

Slide 165

Slide 165 text

Events 1. Request(v) 2. Decide(v) Specification Properties 1. Termination: every process eventually decides on a value v 2. Validity: if a process decides v, then v was proposed by some process 3. Integrity: no process decides twice

Slide 166

Slide 166 text

Slide 167

Slide 167 text

Consensus Algorithms CAP

Slide 168

Slide 168 text

Consensus Algorithms CAP

Slide 169

Slide 169 text

Consensus Algorithms VR Oki & liskov 1988 CAP

Slide 170

Slide 170 text

Consensus Algorithms VR Oki & liskov 1988 Paxos Lamport 1989 CAP

Slide 171

Slide 171 text

Consensus Algorithms VR Oki & liskov 1988 Paxos Lamport 1989 ZAB reed & junquiera 2008 CAP

Slide 172

Slide 172 text

Consensus Algorithms VR Oki & liskov 1988 Paxos Lamport 1989 ZAB reed & junquiera 2008 Raft ongaro & ousterhout 2013 CAP

Slide 173

Slide 173 text

Event Log

Slide 174

Slide 174 text

“Immutability Changes Everything” - Pat Helland Immutable Data Immutability Share Nothing Architecture

Slide 175

Slide 175 text

“Immutability Changes Everything” - Pat Helland Immutable Data Immutability Share Nothing Architecture TRUE Scalability Is the path towards

Slide 176

Slide 176 text

"The database is a cache of a subset of the log” - Pat Helland Think In Facts

Slide 177

Slide 177 text

"The database is a cache of a subset of the log” - Pat Helland Think In Facts Never delete data Knowledge only grows Append-Only Event Log Use Event Sourcing and/or CQRS

Slide 178

Slide 178 text

Aggregate Roots Can wrap multiple Entities Aggregate Root is the Transactional Boundary

Slide 179

Slide 179 text

Aggregate Roots Can wrap multiple Entities Strong Consistency Within Aggregate Eventual Consistency Between Aggregates Aggregate Root is the Transactional Boundary

Slide 180

Slide 180 text

Aggregate Roots Can wrap multiple Entities Strong Consistency Within Aggregate Eventual Consistency Between Aggregates Aggregate Root is the Transactional Boundary No limit to scalability

Slide 181

Slide 181 text

eventual Consistency

Slide 182

Slide 182 text

Dynamo VerY influential CAP Vogels et. al. 2007

Slide 183

Slide 183 text

Dynamo Popularized • Eventual consistency • Epidemic gossip • Consistent hashing ! • Hinted handoff • Read repair • Anti-Entropy W/ Merkle trees VerY influential CAP Vogels et. al. 2007

Slide 184

Slide 184 text

Consistent Hashing Karger et. al. 1997

Slide 185

Slide 185 text

Consistent Hashing Support elasticity— easier to scale up and down Avoids hotspots Enables partitioning and replication Karger et. al. 1997

Slide 186

Slide 186 text

Consistent Hashing Support elasticity— easier to scale up and down Avoids hotspots Enables partitioning and replication Karger et. al. 1997 Only K/N nodes needs to be remapped when adding or removing a node (K=#keys, N=#nodes)

Slide 187

Slide 187 text

How eventual is

Slide 188

Slide 188 text

How eventual is Eventual consistency?

Slide 189

Slide 189 text

How eventual is How consistent is Eventual consistency?

Slide 190

Slide 190 text

How eventual is How consistent is Eventual consistency? Probabilistically Bounded Staleness Peter Bailis et. al 2012 PBS

Slide 191

Slide 191 text

How eventual is How consistent is Eventual consistency? Probabilistically Bounded Staleness Peter Bailis et. al 2012 PBS

Slide 192

Slide 192 text

epidemic Gossip

Slide 193

Slide 193 text

Node ring & Epidemic Gossip CHORD Stoica et al 2001

Slide 194

Slide 194 text

Node ring & Epidemic Gossip Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node CHORD Stoica et al 2001

Slide 195

Slide 195 text

Node ring & Epidemic Gossip Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node CHORD Stoica et al 2001

Slide 196

Slide 196 text

Node ring & Epidemic Gossip Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node CHORD Stoica et al 2001

Slide 197

Slide 197 text

Node ring & Epidemic Gossip Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node Member Node CHORD Stoica et al 2001 CAP

Slide 198

Slide 198 text

Decentralized P2P No SPOF or SPOB Very Scalable Fully Elastic Benefits of Epidemic Gossip ! Requires minimal administration Often used with VECTOR CLOCKS

Slide 199

Slide 199 text

1. Separation of failure detection heartbeat and dissemination of data - DAS et. al. 2002 (SWIM) 2. Push/Pull gossip - Khambatti et. al 2003 1. Hash and compare data 2. Use single hash or Merkle Trees Some Standard Optimizations to Epidemic Gossip

Slide 200

Slide 200 text

disorderly Programming

Slide 201

Slide 201 text

ACID 2.0

Slide 202

Slide 202 text

ACID 2.0 Associative Batch-insensitive (grouping doesn't matter) a+(b+c)=(a+b)+c

Slide 203

Slide 203 text

ACID 2.0 Associative Batch-insensitive (grouping doesn't matter) a+(b+c)=(a+b)+c Commutative Order-insensitive (order doesn't matter) a+b=b+a

Slide 204

Slide 204 text

ACID 2.0 Associative Batch-insensitive (grouping doesn't matter) a+(b+c)=(a+b)+c Commutative Order-insensitive (order doesn't matter) a+b=b+a Idempotent Retransmission-insensitive (duplication does not matter) a+a=a

Slide 205

Slide 205 text

Slide 206

Slide 206 text

Convergent & Commutative Replicated Data Types Shapiro et. al. 2011

Slide 207

Slide 207 text

Convergent & Commutative Replicated Data Types CRDTShapiro et. al. 2011

Slide 208

Slide 208 text

Convergent & Commutative Replicated Data Types CRDTShapiro et. al. 2011 Join Semilattice Monotonic merge function

Slide 209

Slide 209 text

Convergent & Commutative Replicated Data Types Data types Counters Registers Sets Maps Graphs CRDTShapiro et. al. 2011 Join Semilattice Monotonic merge function

Slide 210

Slide 210 text

Convergent & Commutative Replicated Data Types Data types Counters Registers Sets Maps Graphs CRDT CAP Shapiro et. al. 2011 Join Semilattice Monotonic merge function

Slide 211

Slide 211 text

2 TYPES of CRDTs CvRDT Convergent State-based CmRDT Commutative Ops-based

Slide 212

Slide 212 text

2 TYPES of CRDTs CvRDT Convergent State-based CmRDT Commutative Ops-based Self contained, holds all history

Slide 213

Slide 213 text

2 TYPES of CRDTs CvRDT Convergent State-based CmRDT Commutative Ops-based Self contained, holds all history Needs a reliable broadcast channel

Slide 214

Slide 214 text

CALM theorem Consistency As Logical Monotonicity Hellerstein et. al. 2011

Slide 215

Slide 215 text

CALM theorem Consistency As Logical Monotonicity Hellerstein et. al. 2011 Bloom Language Compiler help to detect & encapsulate non- monotonicity

Slide 216

Slide 216 text

CALM theorem Consistency As Logical Monotonicity Distributed Logic Datalog/Dedalus Monotonic functions Just add facts to the system Model state as Lattices Similar to CRDTs (without the scope problem) Hellerstein et. al. 2011 Bloom Language Compiler help to detect & encapsulate non- monotonicity

Slide 217

Slide 217 text

The Akka Way

Slide 218

Slide 218 text

Akka Actors

Slide 219

Slide 219 text

Akka Actors Akka IO

Slide 220

Slide 220 text

Akka Actors Akka IO Akka REMOTE

Slide 221

Slide 221 text

Akka Actors Akka IO Akka REMOTE Akka CLUSTER

Slide 222

Slide 222 text

Akka Actors Akka IO Akka REMOTE Akka CLUSTER Akka CLUSTER EXTENSIONS

Slide 223

Slide 223 text

What is Akka CLUSTER all about? • Cluster Membership • Leader & Singleton • Cluster Sharding • Clustered Routers (adaptive, consistent hashing, …) • Clustered Supervision and Deathwatch • Clustered Pub/Sub • and more

Slide 224

Slide 224 text

cluster membership in Akka

Slide 225

Slide 225 text

cluster membership in Akka • Dynamo-style master-less decentralized P2P

Slide 226

Slide 226 text

cluster membership in Akka • Dynamo-style master-less decentralized P2P • Epidemic Gossip—Node Ring

Slide 227

Slide 227 text

cluster membership in Akka • Dynamo-style master-less decentralized P2P • Epidemic Gossip—Node Ring • Vector Clocks for causal consistency

Slide 228

Slide 228 text

cluster membership in Akka • Dynamo-style master-less decentralized P2P • Epidemic Gossip—Node Ring • Vector Clocks for causal consistency • Fully elastic with no SPOF or SPOB

Slide 229

Slide 229 text

cluster membership in Akka • Dynamo-style master-less decentralized P2P • Epidemic Gossip—Node Ring • Vector Clocks for causal consistency • Fully elastic with no SPOF or SPOB • Very scalable—2400 nodes (on GCE)

Slide 230

Slide 230 text

Slide 231

Slide 231 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock)

Slide 232

Slide 232 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) Is a CRDT

Slide 233

Slide 233 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) Is a CRDT Ordered node ring

Slide 234

Slide 234 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) Is a CRDT Ordered node ring Seen set for convergence

Slide 235

Slide 235 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) Is a CRDT Ordered node ring Seen set for convergence Unreachable set

Slide 236

Slide 236 text

Slide 237

Slide 237 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) 1. Picks random node with older/newer version Is a CRDT Ordered node ring Seen set for convergence Unreachable set Version

Slide 238

Slide 238 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) 1. Picks random node with older/newer version 2. Gossips in a request/reply fashion Is a CRDT Ordered node ring Seen set for convergence Unreachable set Version

Slide 239

Slide 239 text

State Gossip GOSSIPING case class Gossip( members: SortedSet[Member], seen: Set[Member], unreachable: Set[Member], version: VectorClock) 1. Picks random node with older/newer version 2. Gossips in a request/reply fashion 3. Updates internal state and adds himself to ‘seen’ set Is a CRDT Ordered node ring Seen set for convergence Unreachable set Version

Slide 240

Slide 240 text

Cluster Convergence

Slide 241

Slide 241 text

Cluster Convergence Reached when: 1. All nodes are represented in the seen set 2. No members are unreachable, or 3. All unreachable members have status down or exiting

Slide 242

Slide 242 text

GOSSIP BIASED

Slide 243

Slide 243 text

GOSSIP BIASED 80% bias to nodes not in seen table Up to 400 nodes, then reduced

Slide 244

Slide 244 text

PUSH/PULL GOSSIP

Slide 245

Slide 245 text

PUSH/PULL GOSSIP Variation

Slide 246

Slide 246 text

PUSH/PULL GOSSIP Variation case class Status(version: VectorClock)

Slide 247

Slide 247 text

ROLE LEADER

Slide 248

Slide 248 text

ROLE LEADER Any node can be the leader

Slide 249

Slide 249 text

ROLE 1. No election, but deterministic LEADER Any node can be the leader

Slide 250

Slide 250 text

ROLE 1. No election, but deterministic 2. Can change after cluster convergence LEADER Any node can be the leader

Slide 251

Slide 251 text

ROLE 1. No election, but deterministic 2. Can change after cluster convergence 3. Leader has special duties LEADER Any node can be the leader

Slide 252

Slide 252 text

Node Lifecycle in Akka

Slide 253

Slide 253 text

Failure Detection

Slide 254

Slide 254 text

Failure Detection Hashes the node ring Picks 5 nodes Request/Reply heartbeat

Slide 255

Slide 255 text

Failure Detection Hashes the node ring Picks 5 nodes Request/Reply heartbeat To increase likelihood of bridging racks and data centers

Slide 256

Slide 256 text

Failure Detection Cluster Membership Remote Death Watch Remote Supervision Hashes the node ring Picks 5 nodes Request/Reply heartbeat To increase likelihood of bridging racks and data centers Used by

Slide 257

Slide 257 text

Failure Detection Is an Accrual Failure Detector

Slide 258

Slide 258 text

Failure Detection Is an Accrual Failure Detector Does not help much in practice

Slide 259

Slide 259 text

Failure Detection Is an Accrual Failure Detector Does not help much in practice Need to add delay to deal with Garbage Collection

Slide 260

Slide 260 text

Failure Detection Is an Accrual Failure Detector Does not help much in practice Instead of this Need to add delay to deal with Garbage Collection

Slide 261

Slide 261 text

Failure Detection Is an Accrual Failure Detector Does not help much in practice Instead of this It often looks like this Need to add delay to deal with Garbage Collection

Slide 262

Slide 262 text

Network Partitions

Slide 263

Slide 263 text

Network Partitions • Failure Detector can mark an unavailable member Unreachable

Slide 264

Slide 264 text

Network Partitions • Failure Detector can mark an unavailable member Unreachable • If one node is Unreachable then no cluster Convergence

Slide 265

Slide 265 text

Slide 266

Slide 266 text

Slide 267

Slide 267 text

Network Partitions • Failure Detector can mark an unavailable member Unreachable • If one node is Unreachable then no cluster Convergence • This means that the Leader can no longer perform it’s duties • Member can come back from Unreachable—Else: Split Brain

Slide 268

Slide 268 text

Slide 269

Slide 269 text

Slide 270

Slide 270 text

Potential FUTURE Optimizations

Slide 271

Slide 271 text

Potential FUTURE Optimizations • Vector Clock HISTORY pruning

Slide 272

Slide 272 text

Potential FUTURE Optimizations • Vector Clock HISTORY pruning • Delegated heartbeat

Slide 273

Slide 273 text

Potential FUTURE Optimizations • Vector Clock HISTORY pruning • Delegated heartbeat • “Real” push/pull gossip

Slide 274

Slide 274 text

Potential FUTURE Optimizations • Vector Clock HISTORY pruning • Delegated heartbeat • “Real” push/pull gossip • More out-of-the-box auto-down patterns

Slide 275

Slide 275 text

Akka Modules For Distribution

Slide 276

Slide 276 text

Akka Modules For Distribution Akka Cluster Akka Remote Akka HTTP Akka IO

Slide 277

Slide 277 text

Akka Modules For Distribution Akka Cluster Akka Remote Akka HTTP Akka IO Clustered Singleton Clustered Routers Clustered Pub/Sub Cluster Client Consistent Hashing

Slide 278

Slide 278 text

Beyond …and

Slide 279

Slide 279 text

Akka & The Road Ahead Akka HTTP Akka Streams Akka CRDT Akka Raft

Slide 280

Slide 280 text

Akka & The Road Ahead Akka HTTP Akka Streams Akka CRDT Akka Raft Akka 2.4

Slide 281

Slide 281 text

Akka & The Road Ahead Akka HTTP Akka Streams Akka CRDT Akka Raft Akka 2.4 Akka 2.4

Slide 282

Slide 282 text

Akka & The Road Ahead Akka HTTP Akka Streams Akka CRDT Akka Raft Akka 2.4 Akka 2.4 ?

Slide 283

Slide 283 text

Akka & The Road Ahead Akka HTTP Akka Streams Akka CRDT Akka Raft Akka 2.4 Akka 2.4 ? ?

Slide 284

Slide 284 text

Eager for more?

Slide 285

Slide 285 text

Try AKKA out akka.io

Slide 286

Slide 286 text

Join us at React Conf San Francisco Nov 18-21 reactconf.com

Slide 287

Slide 287 text

Join us at React Conf San Francisco Nov 18-21 reactconf.com Early Registration ends tomorrow

Slide 288

Slide 288 text

References • General Distributed Systems • Summary of network reliability post-mortems—more terrifying than the most horrifying Stephen King novel: http://aphyr.com/posts/288-the-network-is- reliable • A Note on Distributed Computing: http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.41.7628 • On the problems with RPC: http://steve.vinoski.net/pdf/IEEE- Convenience_Over_Correctness.pdf • 8 Fallacies of Distributed Computing: https://blogs.oracle.com/jag/resource/ Fallacies.html • 6 Misconceptions of Distributed Computing: www.dsg.cs.tcd.ie/~vjcahill/ sigops98/papers/vogels.ps • Distributed Computing Systems—A Foundational Approach: http:// www.amazon.com/Programming-Distributed-Computing-Systems- Foundational/dp/0262018985 • Introduction to Reliable and Secure Distributed Programming: http:// www.distributedprogramming.net/ • Nice short overview on Distributed Systems: http://book.mixu.net/distsys/ • Meta list of distributed systems readings: https://gist.github.com/macintux/ 6227368

Slide 289

Slide 289 text

References ! • Actor Model • Great discussion between Erik Meijer & Carl Hewitt or the essence of the Actor Model: http:// channel9.msdn.com/Shows/Going+Deep/Hewitt- Meijer-and-Szyperski-The-Actor-Model- everything-you-wanted-to-know-but-were-afraid- to-ask • Carl Hewitt’s 1973 paper deﬁning the Actor Model: http://worrydream.com/refs/Hewitt- ActorModel.pdf • Gul Agha’s Doctoral Dissertation: https:// dspace.mit.edu/handle/1721.1/6952

Slide 290

Slide 290 text

References • FLP • Impossibility of Distributed Consensus with One Faulty Process: http:// cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf • A Brief Tour of FLP: http://the-paper-trail.org/blog/a-brief-tour-of-flp- impossibility/ • CAP • Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services: http://lpd.epfl.ch/sgilbert/pubs/ BrewersConjecture-SigAct.pdf • You Can’t Sacrifice Partition Tolerance: http://codahale.com/you-cant- sacrifice-partition-tolerance/ • Linearizability: A Correctness Condition for Concurrent Objects: http:// courses.cs.vt.edu/~cs5204/fall07-kafura/Papers/TransactionalMemory/ Linearizability.pdf • CAP Twelve Years Later: How the "Rules" Have Changed: http:// www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have- changed • Consistency vs. Availability: http://www.infoq.com/news/2008/01/ consistency-vs-availability

Slide 291

Slide 291 text

References • Time & Order • Post on the problems with Last Write Wins in Riak: http:// aphyr.com/posts/285-call-me-maybe-riak • Time, Clocks, and the Ordering of Events in a Distributed System: http://research.microsoft.com/en-us/um/people/lamport/pubs/ time-clocks.pdf • Vector Clocks: http://zoo.cs.yale.edu/classes/cs426/2012/lab/ bib/ﬁdge88timestamps.pdf • Failure Detection • Unreliable Failure Detectors for Reliable Distributed Systems: http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225- chandra.pdf • The ϕ Accrual Failure Detector: http://ddg.jaist.ac.jp/pub/HDY +04.pdf • SWIM Failure Detector: http://www.cs.cornell.edu/~asdas/ research/dsn02-swim.pdf • Practical Byzantine Fault Tolerance: http://www.pmg.lcs.mit.edu/ papers/osdi99.pdf

Slide 292

Slide 292 text

References • Transactions • Jim Gray’s classic book: http://www.amazon.com/Transaction- Processing-Concepts-Techniques-Management/dp/1558601902 • Highly Available Transactions: Virtues and Limitations: http:// www.bailis.org/papers/hat-vldb2014.pdf • Bolt on Consistency: http://db.cs.berkeley.edu/papers/sigmod13- bolton.pdf • Calvin: Fast Distributed Transactions for Partitioned Database Systems: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf • Spanner: Google's Globally-Distributed Database: http:// research.google.com/archive/spanner.html • Life beyond Distributed Transactions: an Apostate’s Opinion https:// cs.brown.edu/courses/cs227/archives/2012/papers/weaker/ cidr07p15.pdf • Immutability Changes Everything—Pat Hellands talk at Ricon: http:// vimeo.com/52831373 • Unschackle Your Domain (Event Sourcing): http://www.infoq.com/ presentations/greg-young-unshackle-qcon08 • CQRS: http://martinfowler.com/bliki/CQRS.html

Slide 293

Slide 293 text

References • Consensus • Paxos Made Simple: http://research.microsoft.com/en- us/um/people/lamport/pubs/paxos-simple.pdf • Paxos Made Moderately Complex: http:// www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf • A simple totally ordered broadcast protocol (ZAB): labs.yahoo.com/ﬁles/ladis08.pdf • In Search of an Understandable Consensus Algorithm (Raft): https://ramcloud.stanford.edu/wiki/download/ attachments/11370504/raft.pdf • Replication strategy comparison diagram: http:// snarfed.org/transactions_across_datacenters_io.html • Distributed Snapshots: Determining Global States of Distributed Systems: http://www.cs.swarthmore.edu/ ~newhall/readings/snapshots.pdf

Slide 294

Slide 294 text

References • Eventual Consistency • Dynamo: Amazon’s Highly Available Key-value Store: http://www.read.seas.harvard.edu/ ~kohler/class/cs239-w08/ decandia07dynamo.pdf • Consistency vs. Availability: http:// www.infoq.com/news/2008/01/consistency- vs-availability • Consistent Hashing and Random Trees: http:// thor.cs.ucsb.edu/~ravenben/papers/coreos/kll +97.pdf • PBS: Probabilistically Bounded Staleness: http://pbs.cs.berkeley.edu/

Slide 295

Slide 295 text

References • Epidemic Gossip • Chord: A Scalable Peer-to-peer Lookup Service for Internet • Applications: http://pdos.csail.mit.edu/papers/chord:sigcomm01/ chord_sigcomm.pdf • Gossip-style Failure Detector: http://www.cs.cornell.edu/home/rvr/ papers/GossipFD.pdf • GEMS: http://www.hcs.ufl.edu/pubs/GEMS2005.pdf • Efficient Reconciliation and Flow Control for Anti-Entropy Protocols: http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf • 2400 Akka nodes on GCE: http://typesafe.com/blog/running-a-2400- akka-nodes-cluster-on-google-compute-engine • Starting 1000 Akka nodes in 4 min: http://typesafe.com/blog/starting- up-a-1000-node-akka-cluster-in-4-minutes-on-google-compute- engine • Push Pull Gossiping: http://khambatti.com/mujtaba/ ArticlesAndPapers/pdpta03.pdf • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol: http://www.cs.cornell.edu/~asdas/research/ dsn02-swim.pdf

Slide 296

Slide 296 text

References • Conﬂict-Free Replicated Data Types (CRDTs) • A comprehensive study of Convergent and Commutative Replicated Data Types: http://hal.upmc.fr/docs/ 00/55/55/88/PDF/techreport.pdf • Mark Shapiro talks about CRDTs at Microsoft: http:// research.microsoft.com/apps/video/dl.aspx?id=153540 • Akka CRDT project: https://github.com/jboner/akka-crdt • CALM • Dedalus: Datalog in Time and Space: http:// db.cs.berkeley.edu/papers/datalog2011-dedalus.pdf • CALM: http://www.cs.berkeley.edu/~palvaro/cidr11.pdf • Logic and Lattices for Distributed Programming: http:// db.cs.berkeley.edu/papers/UCB-lattice-tr.pdf • Bloom Language website: http://bloom-lang.net • Joe Hellerstein talks about CALM: http://vimeo.com/ 53904989

Slide 297

Slide 297 text

References • Akka Cluster • My Akka Cluster Implementation Notes: https:// gist.github.com/jboner/7692270 • Akka Cluster Speciﬁcation: http://doc.akka.io/docs/ akka/snapshot/common/cluster.html • Akka Cluster Docs: http://doc.akka.io/docs/akka/ snapshot/scala/cluster-usage.html • Akka Failure Detector Docs: http://doc.akka.io/docs/ akka/snapshot/scala/remoting.html#Failure_Detector • Akka Roadmap: https://docs.google.com/a/ typesafe.com/document/d/18W9- fKs55wiFNjXL9q50PYOnR7-nnsImzJqHOPPbM4E/ mobilebasic?pli=1&hl=en_US • Where Akka Came From: http://letitcrash.com/post/ 40599293211/where-akka-came-from