keeping-time-velocity.pdf

Keeping Time in Real Systems @kavya719

timekeeping mechanisms

distributed key-value store.  three nodes. assume no failures, all operations
succeed. userx PUT { key: v } datastore’s timeline

succeed. userx PUT { key: v } userx PUT { key: v2 } datastore’s timeline

succeed. userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline

succeed. value depends on data store’s consistency model userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline

consistency model set of guarantees the system makes about what
events will be visible, and when. set of valid timelines of events These guarantees are informed and enforced by the timekeeping mechanisms used by the system.

computer clocks the system clock, NTP, UNIX time. stepping back
bridging the gap other timekeeping mechanisms Spanner, Riak

computer clocks

the model multiple nodes for fault tolerance, scalability, performance. logical
(processes) or physical (machines). are sequential. communicate by message-passing i.e.  connected by unreliable network,   no shared memory. data may be replicated, partitioned a distributed datastore:

computers have clocks… func measureX() { start := time.Now() x()
end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them?

computers have clocks… func measureX() { start := time.Now() x()
end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them? hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. ?

Details vary by language, OS, architecture, hardware. …but the details
don’t matter today. That said, we will be assuming Linux on an x86 processor. a caveat

computer clocks are not hardware clocks, but are “run” by
hardware, the OS kernel. time.Now() MONOTONIC clock_gettime(CLOCK_REALTIME) sys call to get the value of a particular computer clock The system clock or wall clock. Gives the current UNIX timestamp. hardware clocks drift

set from the hardware clock. (or external source like NTP).
Real Time Clock (RTC) keeps UTC time at system boot “hey HPET, interrupt me in 10ms” then when interrupted, knows to increment by 10ms. “tickless” kernel:  the interrupt interval (“tick”) is dynamically calculated. incr using a hardware ticker. subsequently the system clock is a counter kept by hardware, OS kernel.

set from the hardware clock. (or external source like NTP).
incr using a hardware ticker. these are the hardware clocks that drift. causes system clocks of different computers to change at different rates. at system boot subsequently the system clock is a counter kept by hardware, OS kernel.

NTP is slow etc. synchronizes the system clock to a 
highly accurate clock network: need trusted, reachable NTP servers. NTP is slow, up to hundreds of ms over public internet. stepping results in discontinuous jumps in time. } gradually adjusts clock rate (“skew”) sets a new value (“step”)  if differential is too large.

The system clock keeps UNIX time increases by exactly 86,
400 seconds per day. So,1000th day after the epoch = 86400000 etc. …but a UTC day is not a constant 86, 400 seconds! “number of seconds since epoch” midnight UTC, 01.01.1970

interlude: UTC messy compromise between: measured using atomic clocks atomic
time based on the Earth’s rotation astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) So, UTC: based on atomic time adjusted to be in sync with the Earth’s rotational period.

interlude: UTC messy compromise between: based on the Earth’s rotation
measured using atomic clocks atomic time astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) but problem…

the Earth’s rotation slows down over time. To compensate for
this drift, UTC periodically adds a second. So, an astronomical day “takes longer” in absolute (atomic) terms. …so a UTC day may be 86, 400 or 86, 401 seconds! 23:59:59 23:59:60 00:00:00 leap second

Unix time can’t represent the extra second, but  want the
computer’s “current time” to be aligned with UTC (in the long run): The system clock keeps UNIX time 23:59:59 23:59:59 00:00:00 repeats ! Unix Unix time is not monotonic. 23:59:59 23:59:60 00:00:00 leap second UTC

not synchronized, monotonic across nodes hardware clocks drift. NTP is
slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v } N1 N2 example: fast

slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: fast

slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: ruh roh. fast

other timekeeping mechanisms

prelude timekeeping mechanism used by a system depends on: desired
consistency model  what the valid timelines of events are desired availability  how “responsive” the system is desired performance   read and write latency and so, throughput ] costs of higher consistency (CAP theorem, etc.)

spanner • Distributed relational database  supports distributed transactions • Horizontally
scalable  data is partitioned  • Geo-replicated for fault tolerance • Performant • Externally consistent:  “a globally consistent ordering of transactions that matches the observed commit order.”

scalable  data is partitioned  • Geo-replicated for fault tolerance • Performant • Externally consistent:  “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 checking N2

scalable  data is partitioned  • Geo-replicated for fault tolerance • Performant • Externally consistent:  “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 N1 G1 N2 checking N2 G2

scalable  data is partitioned  • Geo-replicated for fault tolerance • Performant • Externally consistent:  “a globally consistent ordering of transactions that matches the observed commit order.”

savings N1 N1 G1 N2 checking N2 G2 minimum total
balance requirement = 200 total balance = 200 G1 G2 deposit 100 T1 debit 100 T2 US partition EU partition

need desired consistency guarantees, desired performance: reads from replicas,  
consistent snapshot reads consistent timeline across replicas. to order transactions across the system as well. the order to correspond to the observed commit order. want reads to never contain T2, if they don’t also contain T1. “globally consistent transaction order that corresponds to   observed commit order“.  performant consensus.

if T1 commits before T2 starts to commit, T1 is
ordered before T2. Can we enforce ordering using commit timestamps? order of transactions == observed order even if T1, T2 across the globe! Yes, if perfectly synchronized clocks. …or, if you can know clock uncertainty perfectly, and account for it. }

TrueTime tracks and exposes the uncertainty about perceived time across
system clocks. t tt } explicitly represents time as an interval, not a point. TT.now() [earliest, latest] interval that contains “true now”.  earliest is the earliest time that could be   “true now”; latest is the latest.

commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e.
until commit_ts < TT.now().earliest then, commits and replies. if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit ts G1 leader T1

until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit wait T1 commit ts

until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commits guarantees commit_ts for next transaction is higher, despite different clocks. ] commit wait T1 commit ts

commit_ts(T2) = TT.now().latest wait for one full uncertainty window i.e.
until commit_ts < TT.now().earliest then, commit and reply. G2 leader T1 commit ts if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T2 T2 commit ts commit wait commits

TrueTime provides externally consistent transaction commit timestamps, so enables external
consistency without coordination. …this is neat. The uncertainty window affects commit wait time, and so  write latency and throughput. Google uses impressive and expensive! infrastructure to keep this small; ~7ms as of 2012. but note

riak • Distributed key-value database:  // A data item =
<key: blob>  {“uuid1234”: {“name”:”ada”}}  • Highly available:  data partitioned and replicated,  decentralized i.e. all replicas serve reads, writes. • Eventually consistent:  “if no new updates are made to an object, eventually all accesses will return the last updated value.”

if no new updates are made to an object, eventually
all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest.

three replicas. read_quorum = write_quorum = 1. cart: [ ]
{ cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} causal updates converge

if no new updates are made to an object, eventually
all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest. any node serves reads and writes for availability determine conﬂicting updates.

three replicas. read_quorum = write_quorum = 1. cart: [ ]
{ cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX concurrent updates conﬂict

vector clocks logical clocks that use versions as “timestamps”. means
to establish causal ordering. { cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX { cart : [ A ]} userX { cart : [ D ]} A B C D

0 0 0 0 0 0 0 0 0 n1
n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 vector clocks

0 0 0 0 0 0 0 0 0 n1
n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 1 0 0 userX { cart : [ A ] } A 0 0 1 { cart : [ B ] } userY B vector clocks

0 0 0 0 0 0 1 0 0 2
0 0 0 0 0 0 0 1 n1 n2 n3 userX GET cart A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 (2, 0, 0) returns: vector clocks

0 0 0 0 0 0 1 0 0 2
0 0 0 0 0 0 0 1 n1 n2 n3 A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks

0 0 0 0 0 0 1 0 0 2
0 0 0 0 0 0 0 1 0 1 0 n1 n2 n3 A C D B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks

0 0 0 2 1 0 n1 n2 n3 0
0 0 1 0 0 2 0 0 0 0 0 0 0 1 n1 n2 n3 n1 n2 n3 n1 n2 n3 max ((2, 0, 0), (0, 1, 0)) A C D B userX { cart : [ D ] } (2, 0, 0) vector clocks

2 1 0 2 0 0 0 0 1 n1
n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } VCx ≺ VCy indicates x precedes y means to establish causal ordering. 2 0 0 2 1 0 { cart : [ A ] } precedes { cart : [ D ] } vector clocks

2 1 0 2 0 0 0 0 1 n1
n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } If that doesn’t hold for x and y, they conﬂict VCx ≺ VCy indicates x precedes y means to establish causal ordering. { cart : [ D ] } conﬂicts with { cart : [ B ] } 0 0 1 2 1 0 vector clocks

timestampA = (1, 0) A userX PUT { k: v2
} timestampB = (1, 1) B userX PUT { k: v } N1 N2 faster by 10ms

need to passed around. are divorced from physical time. but
logical clocks logical clocks are a clever proxy for physical time. vector clocks, dotted version vectors,   a more precise form that Riak uses. …this is pretty neat too.

stepping back…

TrueTime  + timestamps that correspond to wall-clock time. - specialized
infrastructure. logical clocks  + we can all do causality tracking!   - timestamps don’t correspond to wall-clock time.

hybrid logical clocks <max_physical_time_seen, logical clock> augmented logical clocks:

hybrid logical clocks <max_physical_time_seen, logical clock> augmented logical clocks: A
N1 N2 faster by 20ms userX PUT { k: v } 50 40 timestampB = <100, 2> timestampA = <50, 1> <50, 1> userX PUT { k: v2 } B 55 45 <50, 1>

“A person with a watch knows what time it is.
A person with two watches is never sure.” - Segal’s Law, reworded. @kavya719 speakerdeck.com/kavya719/keeping-time-in-real-systems Special thanks to Eben Freeman for reading drafts of this.

Spanner  Original paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/ spanner-osdi2012.pdf    Brewer’s 2017 paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/
archive/45855.pdf Dynamo http://www.allthingsdistributed.com/ﬁles/amazon-dynamo-sosp2007.pdf   Logical Clocks  http://amturing.acm.org/p558-lamport.pdf  Dotted Version Vectors https://arxiv.org/abs/1011.5808  Hybrid Logical Clocks https://www.cse.buffalo.edu//tech-reports/2014-04.pdf

replicas must agree on the order of transactions. consistent timeline
across replicas N1 N1 G …is logical proxy for physical time. provides a uniﬁed timeline across nodes. leader proposes write to other replicas,  write commits iff n replicas ACK it. Spanner uses Paxos, 2PC (other protocols are 3PC, Raft, Zab). consensus

compromises availability — if n replicas are not be available
to ACK writes. compromises performance —   increases write latency, decreases throughput;  multiple coordination rounds until a write commits. but consensus … so, don’t want to use consensus to order transactions across partitions. e.g. T1, T2

happens-before X ≺ Y IF one of: — same actor
— are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent! orders events Formulated in Lamport’s   Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)

A ≺ C (same actor) C ≺ D (synchronization pair)
So, A ≺ D (transitivity) causality and concurrency A B C D N1 N2 N3

…but B ? D  D ? B So, B, D
concurrent! A B C D N1 N2 N3 causality and concurrency

A B C D N1 N2 N3 { cart :
[ A ] } { cart : [ B ] } { cart : [ A ]} { cart : [ D ]} A ≺ D  D should update A   B, D concurrent B, D need resolution

GET, PUT operations on a key pass around a casual
context object, that contains the vector clocks. a more precise form,  “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to determine causal updates versus conﬂicts.

conflict resolution in riak Behavior is configurable.  Assuming vector clock
analysis enabled:  • last-write-wins  i.e. version with higher timestamp picked. • merge, iff the underlying data type is a CRDT • return conflicting versions to application  riak stores “siblings” or conflicting versions,  returned to application for resolution.

return conﬂicting versions to application: 0 0 1 2 1
0 D: { cart: [ “date crepe” ] } B: { cart: [ “blueberry crepe” ] } Riak stores both versions next op returns both to application application must resolve conﬂict { cart: [ “blueberry crepe”, “date crepe” ] } 2 1 1 which creates a causal update { cart: [ “blueberry crepe”, “date crepe” ] }

…what about resolving those conﬂicts? doesn’t (default behavior). instead, exposes
happens-before graph to the application for conﬂict resolution.

keeping-time-velocity.pdf

keeping-time-velocity.pdf

More Decks by kavya

Featured

Transcript