distributed key-value store. three nodes. assume no failures, all operations succeed. userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline
distributed key-value store. three nodes. assume no failures, all operations succeed. value depends on data store’s consistency model userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline
consistency model set of guarantees the system makes about what events will be visible, and when. set of valid timelines of events These guarantees are informed and enforced by the timekeeping mechanisms used by the system.
the model multiple nodes for fault tolerance, scalability, performance. logical (processes) or physical (machines). are sequential. communicate by message-passing i.e. connected by unreliable network, no shared memory. data may be replicated, partitioned a distributed datastore:
computers have clocks… func measureX() { start := time.Now() x() end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them? hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. ?
Details vary by language, OS, architecture, hardware. …but the details don’t matter today. That said, we will be assuming Linux on an x86 processor. a caveat
computer clocks are not hardware clocks, but are “run” by hardware, the OS kernel. time.Now() MONOTONIC clock_gettime(CLOCK_REALTIME) sys call to get the value of a particular computer clock The system clock or wall clock. Gives the current UNIX timestamp. hardware clocks drift
set from the hardware clock. (or external source like NTP). Real Time Clock (RTC) keeps UTC time at system boot “hey HPET, interrupt me in 10ms” then when interrupted, knows to increment by 10ms. “tickless” kernel: the interrupt interval (“tick”) is dynamically calculated. incr using a hardware ticker. subsequently the system clock is a counter kept by hardware, OS kernel.
set from the hardware clock. (or external source like NTP). incr using a hardware ticker. these are the hardware clocks that drift. causes system clocks of different computers to change at different rates. at system boot subsequently the system clock is a counter kept by hardware, OS kernel.
NTP is slow etc. synchronizes the system clock to a highly accurate clock network: need trusted, reachable NTP servers. NTP is slow, up to hundreds of ms over public internet. stepping results in discontinuous jumps in time. } gradually adjusts clock rate (“skew”) sets a new value (“step”) if differential is too large.
The system clock keeps UNIX time increases by exactly 86, 400 seconds per day. So,1000th day after the epoch = 86400000 etc. …but a UTC day is not a constant 86, 400 seconds! “number of seconds since epoch” midnight UTC, 01.01.1970
interlude: UTC messy compromise between: measured using atomic clocks atomic time based on the Earth’s rotation astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) So, UTC: based on atomic time adjusted to be in sync with the Earth’s rotational period.
interlude: UTC messy compromise between: based on the Earth’s rotation measured using atomic clocks atomic time astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) but problem…
the Earth’s rotation slows down over time. To compensate for this drift, UTC periodically adds a second. So, an astronomical day “takes longer” in absolute (atomic) terms. …so a UTC day may be 86, 400 or 86, 401 seconds! 23:59:59 23:59:60 00:00:00 leap second
Unix time can’t represent the extra second, but want the computer’s “current time” to be aligned with UTC (in the long run): The system clock keeps UNIX time 23:59:59 23:59:59 00:00:00 repeats ! Unix Unix time is not monotonic. 23:59:59 23:59:60 00:00:00 leap second UTC
not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v } N1 N2 example: fast
not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: fast
not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: ruh roh. fast
prelude timekeeping mechanism used by a system depends on: desired consistency model what the valid timelines of events are desired availability how “responsive” the system is desired performance read and write latency and so, throughput ] costs of higher consistency (CAP theorem, etc.)
need desired consistency guarantees, desired performance: reads from replicas, consistent snapshot reads consistent timeline across replicas. to order transactions across the system as well. the order to correspond to the observed commit order. want reads to never contain T2, if they don’t also contain T1. “globally consistent transaction order that corresponds to observed commit order“. performant consensus.
if T1 commits before T2 starts to commit, T1 is ordered before T2. Can we enforce ordering using commit timestamps? order of transactions == observed order even if T1, T2 across the globe! Yes, if perfectly synchronized clocks. …or, if you can know clock uncertainty perfectly, and account for it. }
TrueTime tracks and exposes the uncertainty about perceived time across system clocks. t tt } explicitly represents time as an interval, not a point. TT.now() [earliest, latest] interval that contains “true now”. earliest is the earliest time that could be “true now”; latest is the latest.
commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit ts G1 leader T1
commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit wait T1 commit ts
commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commits guarantees commit_ts for next transaction is higher, despite different clocks. ] commit wait T1 commit ts
commit_ts(T2) = TT.now().latest wait for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commit and reply. G2 leader T1 commit ts if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T2 T2 commit ts commit wait commits
TrueTime provides externally consistent transaction commit timestamps, so enables external consistency without coordination. …this is neat. The uncertainty window affects commit wait time, and so write latency and throughput. Google uses impressive and expensive! infrastructure to keep this small; ~7ms as of 2012. but note
riak • Distributed key-value database: // A data item = {“uuid1234”: {“name”:”ada”}} • Highly available: data partitioned and replicated, decentralized i.e. all replicas serve reads, writes. • Eventually consistent: “if no new updates are made to an object, eventually all accesses will return the last updated value.”
if no new updates are made to an object, eventually all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest.
if no new updates are made to an object, eventually all accesses will return the last updated value. timekeeping want: need: determine causal updates for convergence to latest. any node serves reads and writes for availability determine conflicting updates.
vector clocks logical clocks that use versions as “timestamps”. means to establish causal ordering. { cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX { cart : [ A ]} userX { cart : [ D ]} A B C D
2 1 0 2 0 0 0 0 1 n1 n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } If that doesn’t hold for x and y, they conflict VCx ≺ VCy indicates x precedes y means to establish causal ordering. { cart : [ D ] } conflicts with { cart : [ B ] } 0 0 1 2 1 0 vector clocks
need to passed around. are divorced from physical time. but logical clocks logical clocks are a clever proxy for physical time. vector clocks, dotted version vectors, a more precise form that Riak uses. …this is pretty neat too.
TrueTime + timestamps that correspond to wall-clock time. - specialized infrastructure. logical clocks + we can all do causality tracking! - timestamps don’t correspond to wall-clock time.
“A person with a watch knows what time it is. A person with two watches is never sure.” - Segal’s Law, reworded. @kavya719 speakerdeck.com/kavya719/keeping-time-in-real-systems Special thanks to Eben Freeman for reading drafts of this.
replicas must agree on the order of transactions. consistent timeline across replicas N1 N1 G …is logical proxy for physical time. provides a unified timeline across nodes. leader proposes write to other replicas, write commits iff n replicas ACK it. Spanner uses Paxos, 2PC (other protocols are 3PC, Raft, Zab). consensus
compromises availability — if n replicas are not be available to ACK writes. compromises performance — increases write latency, decreases throughput; multiple coordination rounds until a write commits. but consensus … so, don’t want to use consensus to order transactions across partitions. e.g. T1, T2
happens-before X ≺ Y IF one of: — same actor — are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent! orders events Formulated in Lamport’s Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)
GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. a more precise form, “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to determine causal updates versus conflicts.
conflict resolution in riak Behavior is configurable. Assuming vector clock analysis enabled: • last-write-wins i.e. version with higher timestamp picked. • merge, iff the underlying data type is a CRDT • return conflicting versions to application riak stores “siblings” or conflicting versions, returned to application for resolution.