Slide 1

Slide 1 text

Keeping Time in Real Systems @kavya719

Slide 2

Slide 2 text

kavya

Slide 3

Slide 3 text

timekeeping mechanisms

Slide 4

Slide 4 text

distributed key-value store.
 three nodes. assume no failures, all operations succeed. userx PUT { key: v } datastore’s timeline

Slide 5

Slide 5 text

distributed key-value store.
 three nodes. assume no failures, all operations succeed. userx PUT { key: v } userx PUT { key: v2 } datastore’s timeline

Slide 6

Slide 6 text

distributed key-value store.
 three nodes. assume no failures, all operations succeed. userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline

Slide 7

Slide 7 text

distributed key-value store.
 three nodes. assume no failures, all operations succeed. value depends on data store’s consistency model userx PUT { key: v } userx PUT { key: v2 } usery GET key ? datastore’s timeline

Slide 8

Slide 8 text

consistency model set of guarantees the system makes about what events will be visible, and when. These guarantees are informed and enforced by the timekeeping mechanisms used by the system. set of valid timelines of events

Slide 9

Slide 9 text

computer clocks the system clock, NTP, UNIX time. stepping back other timekeeping mechanisms Spanner, Riak

Slide 10

Slide 10 text

computer clocks

Slide 11

Slide 11 text

the model multiple nodes for fault tolerance, scalability, performance. logical (processes) or physical (machines). are sequential. communicate by message-passing i.e.
 connected by unreliable network, 
 no shared memory. data may be replicated, partitioned a distributed datastore:

Slide 12

Slide 12 text

computers have clocks… func measureX() { start := time.Now() x() end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them?

Slide 13

Slide 13 text

computers have clocks… func measureX() { start := time.Now() x() end := time.Now() // Time x takes. elapsed := end.Sub(start) } …can we use them? hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. ?

Slide 14

Slide 14 text

Details vary by language, OS, architecture, hardware. …but the details don’t matter today. That said, we will be assuming Linux on an x86 processor. a caveat

Slide 15

Slide 15 text

computer clocks are not hardware clocks, but are “run” by hardware, the OS kernel. time.Now() MONOTONIC clock_gettime(CLOCK_REALTIME) sys call to get the value of a particular computer clock The system clock or wall clock. Gives the current UNIX timestamp. hardware clocks drift

Slide 16

Slide 16 text

set from the hardware clock. (or external source like NTP). Real Time Clock (RTC) keeps UTC time at system boot “hey HPET, interrupt me in 10ms” then when interrupted, knows to increment by 10ms. “tickless” kernel:
 the interrupt interval (“tick”) is dynamically calculated. incr using a hardware ticker. subsequently the system clock is a counter kept by hardware, OS kernel.

Slide 17

Slide 17 text

set from the hardware clock. (or external source like NTP). incr using a hardware ticker. these are the hardware clocks that drift. causes system clocks of different computers to change at different rates. at system boot subsequently the system clock is a counter kept by hardware, OS kernel.

Slide 18

Slide 18 text

NTP is slow etc. synchronizes the system clock to a
 highly accurate clock network: need trusted, reachable NTP servers. NTP is slow, up to hundreds of ms over public internet. stepping results in discontinuous jumps in time. } gradually adjusts clock rate (“skew”) sets a new value (“step”)
 if differential is too large.

Slide 19

Slide 19 text

The system clock keeps UNIX time increases by exactly 86, 400 seconds per day. So,1000th day after the epoch = 86400000 etc. …but a UTC day is not a constant 86, 400 seconds! “number of seconds since epoch” midnight UTC, 01.01.1970

Slide 20

Slide 20 text

interlude: UTC messy compromise between: measured using atomic clocks atomic time based on the Earth’s rotation astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) So, UTC: based on atomic time adjusted to be in sync with the Earth’s rotational period.

Slide 21

Slide 21 text

interlude: UTC messy compromise between: based on the Earth’s rotation measured using atomic clocks atomic time astronomical time very stable; this is what we want to use e.g. the (SI) second matches the Earth’s position; sometimes useful (we’re told) but problem…

Slide 22

Slide 22 text

the Earth’s rotation slows down over time. To compensate for this drift, UTC periodically adds a second. So, an astronomical day “takes longer” in absolute (atomic) terms. …so a UTC day may be 86, 400 or 86, 401 seconds! 23:59:59 23:59:60 00:00:00 leap second

Slide 23

Slide 23 text

Unix time can’t represent the extra second, but
 want the computer’s “current time” to be aligned with UTC (in the long run): The system clock keeps UNIX time 23:59:59 23:59:59 00:00:00 repeats ! Unix Unix time is not monotonic. 23:59:59 23:59:60 00:00:00 leap second UTC

Slide 24

Slide 24 text

not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v } N1 N2 example:

Slide 25

Slide 25 text

not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example:

Slide 26

Slide 26 text

not synchronized, monotonic across nodes hardware clocks drift. NTP is slow etc. the system clock keeps Unix time. timestampA = 150 A userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } N1 N2 example: ruh roh.

Slide 27

Slide 27 text

other timekeeping mechanisms

Slide 28

Slide 28 text

prelude timekeeping mechanism used by a system depends on: desired consistency model
 what the valid timelines of events are desired availability
 how “responsive” the system is desired performance 
 read and write latency and so, throughput ] costs of higher consistency (CAP theorem, etc.)

Slide 29

Slide 29 text

spanner • Distributed relational database
 supports distributed transactions • Horizontally scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”

Slide 30

Slide 30 text

spanner • Distributed relational database
 supports distributed transactions • Horizontally scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 checking N2

Slide 31

Slide 31 text

spanner • Distributed relational database
 supports distributed transactions • Horizontally scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.” savings N1 N1 G1 N2 checking N2 G2

Slide 32

Slide 32 text

spanner • Distributed relational database
 supports distributed transactions • Horizontally scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”

Slide 33

Slide 33 text

spanner • Distributed relational database
 supports distributed transactions • Horizontally scalable
 data is partitioned
 • Geo-replicated for fault tolerance • Performant • Externally consistent:
 “a globally consistent ordering of transactions that matches the observed commit order.”

Slide 34

Slide 34 text

savings N1 N1 G1 N2 checking N2 G2 minimum total balance requirement = 200 total balance = 200 G1 G2 deposit 100 T1 debit 100 T2

Slide 35

Slide 35 text

need desired consistency guarantees, desired performance: reads from replicas, 
 consistent snapshot reads consistent timeline across replicas: consensus. to order transactions across the system as well. the order to correspond to the observed commit order. want reads to never contain T2, if they don’t also contain T1. “globally consistent transaction order that corresponds to 
 observed commit order“.
 performant

Slide 36

Slide 36 text

if T1 commits before T2 starts to commit, T1 is ordered before T2. Can we enforce ordering using commit timestamps? order of transactions == observed order even if T1, T2 across the globe! Yes, if perfectly synchronized clocks. …or, if you can know clock uncertainty perfectly, and account for it. }

Slide 37

Slide 37 text

TrueTime tracks and exposes the uncertainty about perceived time across system clocks. t tt } explicitly represents time as an interval, not a point. TT.now() [earliest, latest] interval that contains “true now”.
 earliest is the earliest time that could be 
 “true now”; latest is the latest.

Slide 38

Slide 38 text

commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit ts G1 leader T1

Slide 39

Slide 39 text

commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commit wait T1 commit ts

Slide 40

Slide 40 text

commit_ts(T1) = TT.now().latest waits for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commits and replies. G1 leader if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T1 commits guarantees commit_ts for next transaction is higher, despite different clocks. ] commit wait T1 commit ts

Slide 41

Slide 41 text

commit_ts(T2) = TT.now().latest wait for one full uncertainty window i.e. until commit_ts < TT.now().earliest then, commit and reply. G2 leader T1 commit ts if T1 commits before T2 starts to commit, T1 ’s commit timestamps is smaller than T2 ’s. T2 T2 commit ts commit wait commits

Slide 42

Slide 42 text

TrueTime provides externally consistent transaction commit timestamps, so enables external consistency without coordination. Spanner leverages the uncertainty window to provide 
 strong consistent reads too. …this is neat.

Slide 43

Slide 43 text

The uncertainty window affects commit wait time, and so
 write latency and throughput. Google uses impressive and expensive! infrastructure to keep this small; ~7ms as of 2012. but note

Slide 44

Slide 44 text

riak • Distributed key-value database:
 // A data item = 
 {“uuid1234”: {“name”:”ada”}}
 • Highly available:
 data partitioned and replicated,
 decentralized i.e. all replicas serve reads, writes. • Eventually consistent:
 “if no new updates are made to an object, eventually all accesses will return the last updated value.”

Slide 45

Slide 45 text

three replicas. read_quorum = write_quorum = 1. { cart : [ A ] } N1 N2 N3 userX cart: [ ]

Slide 46

Slide 46 text

three replicas. read_quorum = write_quorum = 1. { cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} cart: [ ]

Slide 47

Slide 47 text

three replicas. read_quorum = write_quorum = 1. { cart : [ A ] } { cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} cart: [ ]

Slide 48

Slide 48 text

three replicas. read_quorum = write_quorum = 1. { cart : [ A ] } { cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} cart: [ ]

Slide 49

Slide 49 text

three replicas. read_quorum = write_quorum = 1. { cart : [ D ] } { cart : [ A ] } N1 N2 N3 userX { cart : [ A ]} userX { cart : [ D ]} cart: [ ]

Slide 50

Slide 50 text

if no new updates are made to an object, eventually all accesses will return the last updated value. timekeeping want: any node serves reads and writes for availability need: determine causal updates for convergence to latest. determine conflicting updates.

Slide 51

Slide 51 text

{ cart : [ A ] } N1 N2 N3 userX cart: [ ]

Slide 52

Slide 52 text

{ cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX cart: [ ] concurrent updates conflict

Slide 53

Slide 53 text

vector clocks logical clocks that use versions as “timestamps”. means to establish causal ordering. { cart : [ A ] } N1 N2 N3 userY { cart : [ B ] } userX { cart : [ A ]} userX { cart : [ D ]} A B C D

Slide 54

Slide 54 text

0 0 0 0 0 0 0 0 0 n1 n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 vector clocks

Slide 55

Slide 55 text

0 0 0 0 0 0 0 0 0 n1 n2 n3 n1 n2 n3 n1 n2 n3 n1 n2 n3 1 0 0 userX { cart : [ A ] } A 0 0 1 { cart : [ B ] } userY B vector clocks

Slide 56

Slide 56 text

0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 n1 n2 n3 userX GET cart A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 (2, 0, 0) returns: vector clocks

Slide 57

Slide 57 text

0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 n1 n2 n3 A C B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks

Slide 58

Slide 58 text

0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 0 1 0 n1 n2 n3 A C D B n1 n2 n3 n1 n2 n3 n1 n2 n3 userX { cart : [ D ] } (2, 0, 0) vector clocks

Slide 59

Slide 59 text

0 0 0 2 1 0 n1 n2 n3 0 0 0 1 0 0 2 0 0 0 0 0 0 0 1 n1 n2 n3 n1 n2 n3 n1 n2 n3 max ((2, 0, 0), (0, 1, 0)) A C D B userX { cart : [ D ] } (2, 0, 0) vector clocks

Slide 60

Slide 60 text

2 1 0 2 0 0 0 0 1 n1 n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } VCx ≺ VCy indicates x precedes y means to establish causal ordering. 2 0 0 2 1 0 { cart : [ A ] } precedes { cart : [ D ] } vector clocks

Slide 61

Slide 61 text

2 1 0 2 0 0 0 0 1 n1 n2 n3 { cart : [ A ] } { cart : [ D ] } { cart : [ B ] } If that doesn’t hold for x and y, they conflict VCx ≺ VCy indicates x precedes y means to establish causal ordering. { cart : [ D ] } conflicts with { cart : [ B ] } 0 0 1 2 1 0 vector clocks

Slide 62

Slide 62 text

need to passed around. are divorced from physical time. but logical clocks logical clocks are a clever proxy for physical time. vector clocks, dotted version vectors, 
 a more precise form that Riak uses. …this is pretty neat too.

Slide 63

Slide 63 text

stepping back…

Slide 64

Slide 64 text

TrueTime augmented physical time timestamps that correspond to wall-clock time. requires globally synchronized clock. vector clocks logical time causality relations. divorced from physical time.

Slide 65

Slide 65 text

“A person with a watch knows what time it is. A person with two watches is never sure.” - Segal’s Law, reworded. @kavya719 speakerdeck.com/kavya719/keeping-time-in-real-systems Special thanks to Eben Freeman for reading drafts of this.

Slide 66

Slide 66 text

Spanner
 Original paper: http://static.googleusercontent.com/media/research.google.com/en/us/archive/ spanner-osdi2012.pdf
 
 Brewer’s 2017 paper: https://static.googleusercontent.com/media/research.google.com/en//pubs/ archive/45855.pdf Dynamo http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf 
 Logical Clocks
 http://amturing.acm.org/p558-lamport.pdf
 Dotted Version Vectors https://arxiv.org/abs/1011.5808
 Hybrid Logical Clocks https://www.cse.buffalo.edu//tech-reports/2014-04.pdf

Slide 67

Slide 67 text

timestampA = 100 userX PUT { k: v2 } timestampB = 50 B userX PUT { k: v } accurate N1 slow N2 Hybrid Logical Clocks augmented logical clocks: ruh roh.

Slide 68

Slide 68 text

timestampA = <100, 1> userX PUT { k: v2 } timestampB = = <100, 2> B userX PUT { k: v } accurate N1 slow N2 <100, 1> Hybrid Logical Clocks augmented logical clocks:

Slide 69

Slide 69 text

replicas must agree on the order of transactions. consistent timeline across replicas N1 N1 G …is logical proxy for physical time. provides a unified timeline across nodes. leader proposes write to other replicas,
 write commits iff n replicas ACK it. Spanner uses Paxos, 2PC (other protocols are 3PC, Raft, Zab). consensus

Slide 70

Slide 70 text

compromises availability — if n replicas are not be available to ACK writes. compromises performance — 
 increases write latency, decreases throughput;
 multiple coordination rounds until a write commits. but consensus … so, don’t want to use consensus to order transactions across partitions. e.g. T1, T2

Slide 71

Slide 71 text

happens-before X ≺ Y IF one of: — same actor — are a synchronization pair — X ≺ E ≺ Y across actors. IF X not ≺ Y and Y not ≺ X , concurrent! orders events Formulated in Lamport’s 
 Time, Clocks, and the Ordering of Events paper in 1978. establishes causality and concurrency. (threads or nodes)

Slide 72

Slide 72 text

A ≺ C (same actor) C ≺ D (synchronization pair) So, A ≺ D (transitivity) causality and concurrency A B C D N1 N2 N3

Slide 73

Slide 73 text

…but B ? D
 D ? B So, B, D concurrent! A B C D N1 N2 N3 causality and concurrency

Slide 74

Slide 74 text

A B C D N1 N2 N3 { cart : [ A ] } { cart : [ B ] } { cart : [ A ]} { cart : [ D ]} A ≺ D
 D should update A 
 B, D concurrent B, D need resolution

Slide 75

Slide 75 text

GET, PUT operations on a key pass around a casual context object, that contains the vector clocks. a more precise form,
 “dotted version vector” Riak stores a vector clock with each version of the data. Therefore, able to determine causal updates versus conflicts.

Slide 76

Slide 76 text

conflict resolution in riak Behavior is configurable.
 Assuming vector clock analysis enabled:
 • last-write-wins
 i.e. version with higher timestamp picked. • merge, iff the underlying data type is a CRDT • return conflicting versions to application
 riak stores “siblings” or conflicting versions,
 returned to application for resolution.

Slide 77

Slide 77 text

return conflicting versions to application: 0 0 1 2 1 0 D: { cart: [ “date crepe” ] } B: { cart: [ “blueberry crepe” ] } Riak stores both versions next op returns both to application application must resolve conflict { cart: [ “blueberry crepe”, “date crepe” ] } 2 1 1 which creates a causal update { cart: [ “blueberry crepe”, “date crepe” ] }

Slide 78

Slide 78 text

…what about resolving those conflicts? doesn’t (default behavior). instead, exposes happens-before graph to the application for conflict resolution.