$30 off During Our Annual Pro Sale. View Details »

Keeping Time in Real Systems

kavya
September 30, 2017

Keeping Time in Real Systems

Time, or a proxy for the notion of time, is crucial in any distributed system. From hardware clocks and NTP to interval clocks and logical clocks, this talk will tour the fascinating timekeeping mechanisms used in real systems. We will explore the different expressions of time in the context of practical systems that use them, and ponder over how the timekeeping mechanism affects the properties of the entire system.

kavya

September 30, 2017
Tweet

More Decks by kavya

Other Decks in Programming

Transcript

  1. Keeping Time
    in Real Systems
    @kavya719

    View Slide

  2. kavya

    View Slide

  3. timekeeping
    mechanisms

    View Slide

  4. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    datastore’s
    timeline

    View Slide

  5. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    datastore’s
    timeline

    View Slide

  6. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    usery
    GET key ?
    datastore’s
    timeline

    View Slide

  7. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    value depends on
    data store’s consistency model
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    usery
    GET key ?
    datastore’s
    timeline

    View Slide

  8. consistency model
    set of guarantees the system makes about
    what events will be visible, and when.
    These guarantees are informed and enforced by the
    timekeeping mechanisms used by the system.
    set of valid timelines of events

    View Slide

  9. computer clocks
    the system clock, NTP, UNIX time.
    stepping back
    other timekeeping mechanisms
    Spanner, Riak

    View Slide

  10. computer clocks

    View Slide

  11. the model
    multiple nodes for fault tolerance, scalability, performance.
    logical (processes) or
    physical (machines).
    are sequential.
    communicate by message-passing i.e.

    connected by unreliable network, 

    no shared memory.
    data may be replicated, partitioned
    a distributed datastore:

    View Slide

  12. computers have clocks…
    func measureX() {
    start := time.Now()
    x()
    end := time.Now()
    // Time x takes.
    elapsed := end.Sub(start)
    }
    …can we use them?

    View Slide

  13. computers have clocks…
    func measureX() {
    start := time.Now()
    x()
    end := time.Now()
    // Time x takes.
    elapsed := end.Sub(start)
    }
    …can we use them?
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    ?

    View Slide

  14. Details vary by language, OS, architecture, hardware.
    …but the details don’t matter today.
    That said, we will be assuming Linux on an x86 processor.
    a caveat

    View Slide

  15. computer clocks are not hardware clocks, but
    are “run” by hardware, the OS kernel.
    time.Now()
    MONOTONIC
    clock_gettime(CLOCK_REALTIME)
    sys call to get the value of a
    particular computer clock
    The system clock or wall clock.
    Gives the current UNIX timestamp.
    hardware clocks drift

    View Slide

  16. set from the hardware clock.
    (or external source like NTP).
    Real Time Clock (RTC)
    keeps UTC time
    at system boot
    “hey HPET,
    interrupt me in 10ms”
    then when interrupted,
    knows to increment by 10ms.
    “tickless” kernel:

    the interrupt interval (“tick”)
    is dynamically calculated.
    incr using a hardware ticker.
    subsequently
    the system clock is a counter kept by hardware, OS kernel.

    View Slide

  17. set from the hardware clock.
    (or external source like NTP).
    incr using a hardware ticker.
    these are the hardware clocks that drift.
    causes system clocks
    of different computers
    to change at different rates.
    at system boot subsequently
    the system clock is a counter kept by hardware, OS kernel.

    View Slide

  18. NTP is slow etc.
    synchronizes the system clock to a

    highly accurate clock network:
    need trusted, reachable NTP servers.
    NTP is slow, up to hundreds of ms over public internet.
    stepping results in discontinuous jumps in time.
    }
    gradually adjusts clock rate (“skew”)
    sets a new value (“step”)

    if differential is too large.

    View Slide

  19. The system clock keeps UNIX time
    increases by exactly 86, 400 seconds per day.
    So,1000th day after the epoch = 86400000 etc.
    …but a UTC day is not a constant 86, 400 seconds!
    “number of seconds since epoch”
    midnight UTC, 01.01.1970

    View Slide

  20. interlude: UTC
    messy compromise between:
    measured using
    atomic clocks
    atomic time
    based on
    the Earth’s rotation
    astronomical time
    very stable;
    this is what we want to use
    e.g. the (SI) second
    matches the Earth’s position;
    sometimes useful (we’re told)
    So, UTC:
    based on atomic time
    adjusted to be in sync with the Earth’s rotational period.

    View Slide

  21. interlude: UTC
    messy compromise between:
    based on
    the Earth’s rotation
    measured using
    atomic clocks
    atomic time astronomical time
    very stable;
    this is what we want to use
    e.g. the (SI) second
    matches the Earth’s position;
    sometimes useful (we’re told)
    but problem…

    View Slide

  22. the Earth’s rotation slows down over time.
    To compensate for this drift, UTC periodically adds a second.
    So, an astronomical day “takes longer” in
    absolute (atomic) terms.
    …so a UTC day may be 86, 400 or 86, 401 seconds!
    23:59:59
    23:59:60
    00:00:00
    leap second

    View Slide

  23. Unix time can’t represent the extra second, but

    want the computer’s “current time” to be aligned with UTC
    (in the long run):
    The system clock keeps UNIX time
    23:59:59
    23:59:59
    00:00:00
    repeats
    !
    Unix
    Unix time is not monotonic.
    23:59:59
    23:59:60
    00:00:00
    leap second
    UTC

    View Slide

  24. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v }
    N1
    N2
    example:

    View Slide

  25. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v2 }
    timestampB
    = 50
    B
    userX
    PUT { k: v }
    N1
    N2
    example:

    View Slide

  26. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v2 }
    timestampB
    = 50
    B
    userX
    PUT { k: v }
    N1
    N2
    example:
    ruh roh.

    View Slide

  27. other
    timekeeping
    mechanisms

    View Slide

  28. prelude
    timekeeping mechanism used by a system depends on:
    desired consistency model

    what the valid timelines of events are
    desired availability

    how “responsive” the system is
    desired performance 

    read and write latency and so, throughput
    ] costs of
    higher consistency
    (CAP theorem, etc.)

    View Slide

  29. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View Slide

  30. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”
    savings
    N1
    checking
    N2

    View Slide

  31. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”
    savings
    N1
    N1
    G1
    N2
    checking
    N2
    G2

    View Slide

  32. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View Slide

  33. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View Slide

  34. savings
    N1
    N1
    G1
    N2
    checking
    N2
    G2
    minimum total balance requirement = 200
    total balance = 200
    G1
    G2
    deposit 100
    T1
    debit 100
    T2

    View Slide

  35. need
    desired consistency guarantees,
    desired performance: reads from replicas, 

    consistent snapshot reads
    consistent timeline across replicas: consensus.
    to order transactions across the system as well.
    the order to correspond to the observed commit order.
    want
    reads to never contain T2,
    if they don’t also contain T1.
    “globally consistent transaction order that corresponds to 

    observed commit order“.

    performant

    View Slide

  36. if T1
    commits before T2
    starts to commit,
    T1
    is ordered before T2.
    Can we enforce ordering using commit timestamps?
    order of transactions == observed order
    even if T1,
    T2
    across the globe!
    Yes, if perfectly synchronized clocks.
    …or, if you can know clock uncertainty perfectly,
    and account for it.
    }

    View Slide

  37. TrueTime
    tracks and exposes the uncertainty about perceived
    time across system clocks.
    t
    tt
    }
    explicitly represents time as an interval, not a point.
    TT.now() [earliest, latest]
    interval that contains “true now”.

    earliest is the earliest time that could be 

    “true now”; latest is the latest.

    View Slide

  38. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commit
    ts
    G1
    leader
    T1

    View Slide

  39. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    G1
    leader
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commit
    wait
    T1
    commit
    ts

    View Slide

  40. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    G1
    leader
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commits
    guarantees
    commit_ts for next
    transaction is
    higher,
    despite different clocks.
    ]
    commit
    wait
    T1
    commit
    ts

    View Slide

  41. commit_ts(T2) = TT.now().latest
    wait for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commit and reply.
    G2
    leader
    T1
    commit
    ts
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T2
    T2
    commit
    ts
    commit
    wait
    commits

    View Slide

  42. TrueTime provides externally consistent
    transaction commit timestamps,
    so enables external consistency without coordination.
    Spanner leverages the uncertainty window to provide 

    strong consistent reads too.
    …this is neat.

    View Slide

  43. The uncertainty window affects commit wait time, and so

    write latency and throughput.
    Google uses impressive and expensive! infrastructure
    to keep this small; ~7ms as of 2012.
    but note

    View Slide

  44. riak
    • Distributed key-value database:

    // A data item = 

    {“uuid1234”: {“name”:”ada”}}

    • Highly available:

    data partitioned and replicated,

    decentralized i.e. all replicas serve reads,
    writes.
    • Eventually consistent:

    “if no new updates are made to an object,
    eventually all accesses will return the last
    updated value.”

    View Slide

  45. three replicas.
    read_quorum = write_quorum = 1.
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    cart: [ ]

    View Slide

  46. three replicas.
    read_quorum = write_quorum = 1.
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    cart: [ ]

    View Slide

  47. three replicas.
    read_quorum = write_quorum = 1.
    { cart : [ A ] }
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    cart: [ ]

    View Slide

  48. three replicas.
    read_quorum = write_quorum = 1.
    { cart : [ A ] }
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    cart: [ ]

    View Slide

  49. three replicas.
    read_quorum = write_quorum = 1.
    { cart : [ D ] }
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    cart: [ ]

    View Slide

  50. if no new updates are made to an object,
    eventually all accesses will return the last updated value.
    timekeeping
    want:
    any node serves reads and writes for availability
    need:
    determine causal updates for convergence to latest.
    determine conflicting updates.

    View Slide

  51. { cart : [ A ] }
    N1
    N2
    N3
    userX
    cart: [ ]

    View Slide

  52. { cart : [ A ] }
    N1
    N2
    N3
    userY
    { cart : [ B ] }
    userX
    cart: [ ]
    concurrent updates conflict

    View Slide

  53. vector clocks
    logical clocks that use versions as “timestamps”.
    means to establish causal ordering.
    { cart : [ A ] }
    N1
    N2
    N3
    userY
    { cart : [ B ] }
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    A B
    C D

    View Slide

  54. 0 0 0
    0 0 0 0 0 0
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    n1 n2 n3
    vector clocks

    View Slide

  55. 0 0 0
    0 0 0 0 0 0
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    n1 n2 n3
    1 0 0
    userX
    { cart : [ A ] }
    A
    0 0 1
    { cart : [ B ] }
    userY
    B
    vector clocks

    View Slide

  56. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    userX
    GET cart
    A
    C
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    (2, 0, 0)
    returns:
    vector clocks

    View Slide

  57. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    A
    C
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View Slide

  58. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    0 1 0
    n1 n2 n3
    A
    C
    D
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View Slide

  59. 0 0 0
    2 1 0
    n1
    n2
    n3
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    n1
    n2
    n3 n1
    n2
    n3
    max ((2, 0, 0),
    (0, 1, 0))
    A
    C
    D
    B
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View Slide

  60. 2 1 0
    2 0 0 0 0 1
    n1 n2 n3
    { cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
    VCx
    ≺ VCy
    indicates x precedes y
    means to establish causal ordering.
    2 0 0
    2 1 0
    { cart : [ A ] } precedes { cart : [ D ] }
    vector clocks

    View Slide

  61. 2 1 0
    2 0 0 0 0 1
    n1 n2 n3
    { cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
    If that doesn’t hold for x and y, they conflict
    VCx
    ≺ VCy
    indicates x precedes y
    means to establish causal ordering.
    { cart : [ D ] } conflicts with { cart : [ B ] }
    0 0 1
    2 1 0
    vector clocks

    View Slide

  62. need to passed around.
    are divorced from physical time.
    but logical clocks
    logical clocks are a clever proxy for physical time.
    vector clocks,
    dotted version vectors, 

    a more precise form that Riak uses.
    …this is pretty neat too.

    View Slide

  63. stepping back…

    View Slide

  64. TrueTime
    augmented physical time
    timestamps that
    correspond
    to wall-clock time.
    requires
    globally synchronized
    clock.
    vector clocks
    logical time
    causality relations.
    divorced
    from
    physical time.

    View Slide

  65. “A person with a watch knows
    what time it is. A person with
    two watches is never sure.”
    - Segal’s Law, reworded.
    @kavya719
    speakerdeck.com/kavya719/keeping-time-in-real-systems
    Special thanks to Eben Freeman for reading drafts of this.

    View Slide

  66. Spanner

    Original paper:
    http://static.googleusercontent.com/media/research.google.com/en/us/archive/
    spanner-osdi2012.pdf


    Brewer’s 2017 paper:
    https://static.googleusercontent.com/media/research.google.com/en//pubs/
    archive/45855.pdf
    Dynamo
    http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

    Logical Clocks

    http://amturing.acm.org/p558-lamport.pdf

    Dotted Version Vectors
    https://arxiv.org/abs/1011.5808

    Hybrid Logical Clocks
    https://www.cse.buffalo.edu//tech-reports/2014-04.pdf

    View Slide


  67. timestampA
    = 100
    userX
    PUT { k: v2 }
    timestampB
    = 50
    B
    userX
    PUT { k: v }
    accurate N1
    slow N2
    Hybrid Logical Clocks
    augmented logical clocks:
    ruh roh.

    View Slide

  68. timestampA
    = <100, 1>
    userX
    PUT { k: v2 }
    timestampB
    =
    = <100, 2>
    B
    userX
    PUT { k: v }
    accurate N1
    slow N2
    <100, 1>
    Hybrid Logical Clocks

    augmented logical clocks:

    View Slide

  69. replicas must agree on the order of transactions.
    consistent timeline across replicas
    N1
    N1
    G
    …is logical proxy for physical time.
    provides a unified timeline across nodes.
    leader proposes write to other replicas,

    write commits iff n replicas ACK it.
    Spanner uses Paxos, 2PC
    (other protocols are 3PC, Raft, Zab).
    consensus

    View Slide

  70. compromises availability —
    if n replicas are not be available to ACK writes.
    compromises performance — 

    increases write latency, decreases throughput;

    multiple coordination rounds until a write commits.
    but consensus
    … so, don’t want to use consensus
    to order transactions across partitions.
    e.g. T1,
    T2

    View Slide

  71. happens-before
    X ≺ Y IF one of:
    — same actor
    — are a synchronization pair
    — X ≺ E ≺ Y
    across actors.
    IF X not ≺ Y and Y not ≺ X ,
    concurrent!
    orders events
    Formulated in Lamport’s 

    Time, Clocks, and the
    Ordering of Events paper
    in 1978.
    establishes causality and
    concurrency.
    (threads or nodes)

    View Slide

  72. A ≺ C (same actor)
    C ≺ D (synchronization pair)
    So, A ≺ D (transitivity)
    causality and concurrency
    A B
    C D
    N1
    N2
    N3

    View Slide

  73. …but B ? D

    D ? B
    So, B, D concurrent!
    A B
    C D
    N1
    N2
    N3
    causality and concurrency

    View Slide

  74. A B
    C D
    N1
    N2
    N3
    { cart : [ A ] }
    { cart : [ B ] }
    { cart : [ A ]} { cart : [ D ]}
    A ≺ D

    D should update A

    B, D concurrent
    B, D need resolution

    View Slide

  75. GET, PUT operations on a key pass around a casual context object,
    that contains the vector clocks.
    a more precise form,

    “dotted version vector”
    Riak stores a vector clock with each version of the data.
    Therefore, able to determine causal updates versus conflicts.

    View Slide

  76. conflict resolution in riak
    Behavior is configurable.

    Assuming vector clock analysis enabled:

    • last-write-wins

    i.e. version with higher timestamp picked.
    • merge, iff the underlying data type is a CRDT
    • return conflicting versions to application

    riak stores “siblings” or conflicting versions,

    returned to application for resolution.

    View Slide

  77. return conflicting versions to application:
    0 0 1
    2 1 0
    D: { cart: [ “date crepe” ] }
    B: { cart: [ “blueberry crepe” ] }
    Riak stores both versions
    next op returns both to application
    application must resolve conflict
    { cart: [ “blueberry crepe”, “date crepe” ] }
    2 1 1
    which creates a causal update
    { cart: [ “blueberry crepe”, “date crepe” ] }

    View Slide

  78. …what about resolving those conflicts?
    doesn’t
    (default behavior).
    instead, exposes happens-before graph
    to the application for conflict resolution.

    View Slide