distributed key-value store.
three nodes.
assume no failures, all operations succeed.
userx
PUT { key: v }
datastore’s
timeline
Slide 5
Slide 5 text
distributed key-value store.
three nodes.
assume no failures, all operations succeed.
userx
PUT { key: v }
userx
PUT { key: v2 }
datastore’s
timeline
Slide 6
Slide 6 text
distributed key-value store.
three nodes.
assume no failures, all operations succeed.
userx
PUT { key: v }
userx
PUT { key: v2 }
usery
GET key ?
datastore’s
timeline
Slide 7
Slide 7 text
distributed key-value store.
three nodes.
assume no failures, all operations succeed.
value depends on
data store’s consistency model
userx
PUT { key: v }
userx
PUT { key: v2 }
usery
GET key ?
datastore’s
timeline
Slide 8
Slide 8 text
consistency model
set of guarantees the system makes about
what events will be visible, and when.
These guarantees are informed and enforced by the
timekeeping mechanisms used by the system.
set of valid timelines of events
Slide 9
Slide 9 text
computer clocks
the system clock, NTP, UNIX time.
stepping back
other timekeeping mechanisms
Spanner, Riak
Slide 10
Slide 10 text
computer clocks
Slide 11
Slide 11 text
the model
multiple nodes for fault tolerance, scalability, performance.
logical (processes) or
physical (machines).
are sequential.
communicate by message-passing i.e.
connected by unreliable network,
no shared memory.
data may be replicated, partitioned
a distributed datastore:
Slide 12
Slide 12 text
computers have clocks…
func measureX() {
start := time.Now()
x()
end := time.Now()
// Time x takes.
elapsed := end.Sub(start)
}
…can we use them?
Slide 13
Slide 13 text
computers have clocks…
func measureX() {
start := time.Now()
x()
end := time.Now()
// Time x takes.
elapsed := end.Sub(start)
}
…can we use them?
hardware clocks drift.
NTP is slow etc.
the system clock keeps Unix time.
?
Slide 14
Slide 14 text
Details vary by language, OS, architecture, hardware.
…but the details don’t matter today.
That said, we will be assuming Linux on an x86 processor.
a caveat
Slide 15
Slide 15 text
computer clocks are not hardware clocks, but
are “run” by hardware, the OS kernel.
time.Now()
MONOTONIC
clock_gettime(CLOCK_REALTIME)
sys call to get the value of a
particular computer clock
The system clock or wall clock.
Gives the current UNIX timestamp.
hardware clocks drift
Slide 16
Slide 16 text
set from the hardware clock.
(or external source like NTP).
Real Time Clock (RTC)
keeps UTC time
at system boot
“hey HPET,
interrupt me in 10ms”
then when interrupted,
knows to increment by 10ms.
“tickless” kernel:
the interrupt interval (“tick”)
is dynamically calculated.
incr using a hardware ticker.
subsequently
the system clock is a counter kept by hardware, OS kernel.
Slide 17
Slide 17 text
set from the hardware clock.
(or external source like NTP).
incr using a hardware ticker.
these are the hardware clocks that drift.
causes system clocks
of different computers
to change at different rates.
at system boot subsequently
the system clock is a counter kept by hardware, OS kernel.
Slide 18
Slide 18 text
NTP is slow etc.
synchronizes the system clock to a
highly accurate clock network:
need trusted, reachable NTP servers.
NTP is slow, up to hundreds of ms over public internet.
stepping results in discontinuous jumps in time.
}
gradually adjusts clock rate (“skew”)
sets a new value (“step”)
if differential is too large.
Slide 19
Slide 19 text
The system clock keeps UNIX time
increases by exactly 86, 400 seconds per day.
So,1000th day after the epoch = 86400000 etc.
…but a UTC day is not a constant 86, 400 seconds!
“number of seconds since epoch”
midnight UTC, 01.01.1970
Slide 20
Slide 20 text
interlude: UTC
messy compromise between:
measured using
atomic clocks
atomic time
based on
the Earth’s rotation
astronomical time
very stable;
this is what we want to use
e.g. the (SI) second
matches the Earth’s position;
sometimes useful (we’re told)
So, UTC:
based on atomic time
adjusted to be in sync with the Earth’s rotational period.
Slide 21
Slide 21 text
interlude: UTC
messy compromise between:
based on
the Earth’s rotation
measured using
atomic clocks
atomic time astronomical time
very stable;
this is what we want to use
e.g. the (SI) second
matches the Earth’s position;
sometimes useful (we’re told)
but problem…
Slide 22
Slide 22 text
the Earth’s rotation slows down over time.
To compensate for this drift, UTC periodically adds a second.
So, an astronomical day “takes longer” in
absolute (atomic) terms.
…so a UTC day may be 86, 400 or 86, 401 seconds!
23:59:59
23:59:60
00:00:00
leap second
Slide 23
Slide 23 text
Unix time can’t represent the extra second, but
want the computer’s “current time” to be aligned with UTC
(in the long run):
The system clock keeps UNIX time
23:59:59
23:59:59
00:00:00
repeats
!
Unix
Unix time is not monotonic.
23:59:59
23:59:60
00:00:00
leap second
UTC
Slide 24
Slide 24 text
not synchronized, monotonic across nodes
hardware clocks drift.
NTP is slow etc.
the system clock keeps Unix time.
timestampA
= 150
A
userX
PUT { k: v }
N1
N2
example:
Slide 25
Slide 25 text
not synchronized, monotonic across nodes
hardware clocks drift.
NTP is slow etc.
the system clock keeps Unix time.
timestampA
= 150
A
userX
PUT { k: v2 }
timestampB
= 50
B
userX
PUT { k: v }
N1
N2
example:
Slide 26
Slide 26 text
not synchronized, monotonic across nodes
hardware clocks drift.
NTP is slow etc.
the system clock keeps Unix time.
timestampA
= 150
A
userX
PUT { k: v2 }
timestampB
= 50
B
userX
PUT { k: v }
N1
N2
example:
ruh roh.
Slide 27
Slide 27 text
other
timekeeping
mechanisms
Slide 28
Slide 28 text
prelude
timekeeping mechanism used by a system depends on:
desired consistency model
what the valid timelines of events are
desired availability
how “responsive” the system is
desired performance
read and write latency and so, throughput
] costs of
higher consistency
(CAP theorem, etc.)
Slide 29
Slide 29 text
spanner
• Distributed relational database
supports distributed transactions
• Horizontally scalable
data is partitioned
• Geo-replicated for fault tolerance
• Performant
• Externally consistent:
“a globally consistent ordering of
transactions that matches the observed
commit order.”
Slide 30
Slide 30 text
spanner
• Distributed relational database
supports distributed transactions
• Horizontally scalable
data is partitioned
• Geo-replicated for fault tolerance
• Performant
• Externally consistent:
“a globally consistent ordering of
transactions that matches the observed
commit order.”
savings
N1
checking
N2
Slide 31
Slide 31 text
spanner
• Distributed relational database
supports distributed transactions
• Horizontally scalable
data is partitioned
• Geo-replicated for fault tolerance
• Performant
• Externally consistent:
“a globally consistent ordering of
transactions that matches the observed
commit order.”
savings
N1
N1
G1
N2
checking
N2
G2
Slide 32
Slide 32 text
spanner
• Distributed relational database
supports distributed transactions
• Horizontally scalable
data is partitioned
• Geo-replicated for fault tolerance
• Performant
• Externally consistent:
“a globally consistent ordering of
transactions that matches the observed
commit order.”
Slide 33
Slide 33 text
spanner
• Distributed relational database
supports distributed transactions
• Horizontally scalable
data is partitioned
• Geo-replicated for fault tolerance
• Performant
• Externally consistent:
“a globally consistent ordering of
transactions that matches the observed
commit order.”
need
desired consistency guarantees,
desired performance: reads from replicas,
consistent snapshot reads
consistent timeline across replicas: consensus.
to order transactions across the system as well.
the order to correspond to the observed commit order.
want
reads to never contain T2,
if they don’t also contain T1.
“globally consistent transaction order that corresponds to
observed commit order“.
performant
Slide 36
Slide 36 text
if T1
commits before T2
starts to commit,
T1
is ordered before T2.
Can we enforce ordering using commit timestamps?
order of transactions == observed order
even if T1,
T2
across the globe!
Yes, if perfectly synchronized clocks.
…or, if you can know clock uncertainty perfectly,
and account for it.
}
Slide 37
Slide 37 text
TrueTime
tracks and exposes the uncertainty about perceived
time across system clocks.
t
tt
}
explicitly represents time as an interval, not a point.
TT.now() [earliest, latest]
interval that contains “true now”.
earliest is the earliest time that could be
“true now”; latest is the latest.
Slide 38
Slide 38 text
commit_ts(T1) = TT.now().latest
waits for one full uncertainty window
i.e. until commit_ts < TT.now().earliest
then, commits and replies.
if T1
commits before T2
starts to commit,
T1
’s commit timestamps is smaller than T2
’s.
T1
commit
ts
G1
leader
T1
Slide 39
Slide 39 text
commit_ts(T1) = TT.now().latest
waits for one full uncertainty window
i.e. until commit_ts < TT.now().earliest
then, commits and replies.
G1
leader
if T1
commits before T2
starts to commit,
T1
’s commit timestamps is smaller than T2
’s.
T1
commit
wait
T1
commit
ts
Slide 40
Slide 40 text
commit_ts(T1) = TT.now().latest
waits for one full uncertainty window
i.e. until commit_ts < TT.now().earliest
then, commits and replies.
G1
leader
if T1
commits before T2
starts to commit,
T1
’s commit timestamps is smaller than T2
’s.
T1
commits
guarantees
commit_ts for next
transaction is
higher,
despite different clocks.
]
commit
wait
T1
commit
ts
Slide 41
Slide 41 text
commit_ts(T2) = TT.now().latest
wait for one full uncertainty window
i.e. until commit_ts < TT.now().earliest
then, commit and reply.
G2
leader
T1
commit
ts
if T1
commits before T2
starts to commit,
T1
’s commit timestamps is smaller than T2
’s.
T2
T2
commit
ts
commit
wait
commits
Slide 42
Slide 42 text
TrueTime provides externally consistent
transaction commit timestamps,
so enables external consistency without coordination.
Spanner leverages the uncertainty window to provide
strong consistent reads too.
…this is neat.
Slide 43
Slide 43 text
The uncertainty window affects commit wait time, and so
write latency and throughput.
Google uses impressive and expensive! infrastructure
to keep this small; ~7ms as of 2012.
but note
Slide 44
Slide 44 text
riak
• Distributed key-value database:
// A data item =
{“uuid1234”: {“name”:”ada”}}
• Highly available:
data partitioned and replicated,
decentralized i.e. all replicas serve reads,
writes.
• Eventually consistent:
“if no new updates are made to an object,
eventually all accesses will return the last
updated value.”
three replicas.
read_quorum = write_quorum = 1.
{ cart : [ A ] }
N1
N2
N3
userX
{ cart : [ A ]}
userX
{ cart : [ D ]}
cart: [ ]
Slide 47
Slide 47 text
three replicas.
read_quorum = write_quorum = 1.
{ cart : [ A ] }
{ cart : [ A ] }
N1
N2
N3
userX
{ cart : [ A ]}
userX
{ cart : [ D ]}
cart: [ ]
Slide 48
Slide 48 text
three replicas.
read_quorum = write_quorum = 1.
{ cart : [ A ] }
{ cart : [ A ] }
N1
N2
N3
userX
{ cart : [ A ]}
userX
{ cart : [ D ]}
cart: [ ]
Slide 49
Slide 49 text
three replicas.
read_quorum = write_quorum = 1.
{ cart : [ D ] }
{ cart : [ A ] }
N1
N2
N3
userX
{ cart : [ A ]}
userX
{ cart : [ D ]}
cart: [ ]
Slide 50
Slide 50 text
if no new updates are made to an object,
eventually all accesses will return the last updated value.
timekeeping
want:
any node serves reads and writes for availability
need:
determine causal updates for convergence to latest.
determine conflicting updates.
vector clocks
logical clocks that use versions as “timestamps”.
means to establish causal ordering.
{ cart : [ A ] }
N1
N2
N3
userY
{ cart : [ B ] }
userX
{ cart : [ A ]}
userX
{ cart : [ D ]}
A B
C D
0 0 0
0 0 0
1 0 0
2 0 0
0 0 0
0 0 1
0 1 0
n1 n2 n3
A
C
D
B
n1
n2
n3 n1
n2
n3
n1
n2
n3
userX
{ cart : [ D ] }
(2, 0, 0)
vector clocks
Slide 59
Slide 59 text
0 0 0
2 1 0
n1
n2
n3
0 0 0
1 0 0
2 0 0
0 0 0
0 0 1
n1 n2 n3
n1
n2
n3 n1
n2
n3
max ((2, 0, 0),
(0, 1, 0))
A
C
D
B
userX
{ cart : [ D ] }
(2, 0, 0)
vector clocks
Slide 60
Slide 60 text
2 1 0
2 0 0 0 0 1
n1 n2 n3
{ cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
VCx
≺ VCy
indicates x precedes y
means to establish causal ordering.
2 0 0
2 1 0
{ cart : [ A ] } precedes { cart : [ D ] }
vector clocks
Slide 61
Slide 61 text
2 1 0
2 0 0 0 0 1
n1 n2 n3
{ cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
If that doesn’t hold for x and y, they conflict
VCx
≺ VCy
indicates x precedes y
means to establish causal ordering.
{ cart : [ D ] } conflicts with { cart : [ B ] }
0 0 1
2 1 0
vector clocks
Slide 62
Slide 62 text
need to passed around.
are divorced from physical time.
but logical clocks
logical clocks are a clever proxy for physical time.
vector clocks,
dotted version vectors,
a more precise form that Riak uses.
…this is pretty neat too.
Slide 63
Slide 63 text
stepping back…
Slide 64
Slide 64 text
TrueTime
augmented physical time
timestamps that
correspond
to wall-clock time.
requires
globally synchronized
clock.
vector clocks
logical time
causality relations.
divorced
from
physical time.
Slide 65
Slide 65 text
“A person with a watch knows
what time it is. A person with
two watches is never sure.”
- Segal’s Law, reworded.
@kavya719
speakerdeck.com/kavya719/keeping-time-in-real-systems
Special thanks to Eben Freeman for reading drafts of this.
Slide 66
Slide 66 text
Spanner
Original paper:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/
spanner-osdi2012.pdf
Brewer’s 2017 paper:
https://static.googleusercontent.com/media/research.google.com/en//pubs/
archive/45855.pdf
Dynamo
http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Logical Clocks
http://amturing.acm.org/p558-lamport.pdf
Dotted Version Vectors
https://arxiv.org/abs/1011.5808
Hybrid Logical Clocks
https://www.cse.buffalo.edu//tech-reports/2014-04.pdf
Slide 67
Slide 67 text
timestampA
= 100
userX
PUT { k: v2 }
timestampB
= 50
B
userX
PUT { k: v }
accurate N1
slow N2
Hybrid Logical Clocks
augmented logical clocks:
ruh roh.
Slide 68
Slide 68 text
timestampA
= <100, 1>
userX
PUT { k: v2 }
timestampB
=
= <100, 2>
B
userX
PUT { k: v }
accurate N1
slow N2
<100, 1>
Hybrid Logical Clocks
augmented logical clocks:
Slide 69
Slide 69 text
replicas must agree on the order of transactions.
consistent timeline across replicas
N1
N1
G
…is logical proxy for physical time.
provides a unified timeline across nodes.
leader proposes write to other replicas,
write commits iff n replicas ACK it.
Spanner uses Paxos, 2PC
(other protocols are 3PC, Raft, Zab).
consensus
Slide 70
Slide 70 text
compromises availability —
if n replicas are not be available to ACK writes.
compromises performance —
increases write latency, decreases throughput;
multiple coordination rounds until a write commits.
but consensus
… so, don’t want to use consensus
to order transactions across partitions.
e.g. T1,
T2
Slide 71
Slide 71 text
happens-before
X ≺ Y IF one of:
— same actor
— are a synchronization pair
— X ≺ E ≺ Y
across actors.
IF X not ≺ Y and Y not ≺ X ,
concurrent!
orders events
Formulated in Lamport’s
Time, Clocks, and the
Ordering of Events paper
in 1978.
establishes causality and
concurrency.
(threads or nodes)
Slide 72
Slide 72 text
A ≺ C (same actor)
C ≺ D (synchronization pair)
So, A ≺ D (transitivity)
causality and concurrency
A B
C D
N1
N2
N3
Slide 73
Slide 73 text
…but B ? D
D ? B
So, B, D concurrent!
A B
C D
N1
N2
N3
causality and concurrency
Slide 74
Slide 74 text
A B
C D
N1
N2
N3
{ cart : [ A ] }
{ cart : [ B ] }
{ cart : [ A ]} { cart : [ D ]}
A ≺ D
D should update A
B, D concurrent
B, D need resolution
Slide 75
Slide 75 text
GET, PUT operations on a key pass around a casual context object,
that contains the vector clocks.
a more precise form,
“dotted version vector”
Riak stores a vector clock with each version of the data.
Therefore, able to determine causal updates versus conflicts.
Slide 76
Slide 76 text
conflict resolution in riak
Behavior is configurable.
Assuming vector clock analysis enabled:
• last-write-wins
i.e. version with higher timestamp picked.
• merge, iff the underlying data type is a CRDT
• return conflicting versions to application
riak stores “siblings” or conflicting versions,
returned to application for resolution.
Slide 77
Slide 77 text
return conflicting versions to application:
0 0 1
2 1 0
D: { cart: [ “date crepe” ] }
B: { cart: [ “blueberry crepe” ] }
Riak stores both versions
next op returns both to application
application must resolve conflict
{ cart: [ “blueberry crepe”, “date crepe” ] }
2 1 1
which creates a causal update
{ cart: [ “blueberry crepe”, “date crepe” ] }
Slide 78
Slide 78 text
…what about resolving those conflicts?
doesn’t
(default behavior).
instead, exposes happens-before graph
to the application for conflict resolution.