Upgrade to Pro — share decks privately, control downloads, hide ads and more …

keeping-time-velocity.pdf

kavya
October 20, 2017
160

 keeping-time-velocity.pdf

kavya

October 20, 2017
Tweet

Transcript

  1. Keeping Time
    in Real Systems
    @kavya719

    View full-size slide

  2. timekeeping
    mechanisms

    View full-size slide

  3. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    datastore’s
    timeline

    View full-size slide

  4. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    datastore’s
    timeline

    View full-size slide

  5. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    usery
    GET key ?
    datastore’s
    timeline

    View full-size slide

  6. distributed key-value store.

    three nodes.
    assume no failures, all operations succeed.
    value depends on
    data store’s consistency model
    userx
    PUT { key: v }
    userx
    PUT { key: v2 }
    usery
    GET key ?
    datastore’s
    timeline

    View full-size slide

  7. consistency model
    set of guarantees the system makes about
    what events will be visible, and when.
    set of valid timelines of events
    These guarantees are informed and enforced by the
    timekeeping mechanisms used by the system.

    View full-size slide

  8. computer clocks
    the system clock, NTP, UNIX time.
    stepping back
    bridging the gap
    other timekeeping mechanisms
    Spanner, Riak

    View full-size slide

  9. computer clocks

    View full-size slide

  10. the model
    multiple nodes for fault tolerance, scalability, performance.
    logical (processes) or
    physical (machines).
    are sequential.
    communicate by message-passing i.e.

    connected by unreliable network, 

    no shared memory.
    data may be replicated, partitioned
    a distributed datastore:

    View full-size slide

  11. computers have clocks…
    func measureX() {
    start := time.Now()
    x()
    end := time.Now()
    // Time x takes.
    elapsed := end.Sub(start)
    }
    …can we use them?

    View full-size slide

  12. computers have clocks…
    func measureX() {
    start := time.Now()
    x()
    end := time.Now()
    // Time x takes.
    elapsed := end.Sub(start)
    }
    …can we use them?
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    ?

    View full-size slide

  13. Details vary by language, OS, architecture, hardware.
    …but the details don’t matter today.
    That said, we will be assuming Linux on an x86 processor.
    a caveat

    View full-size slide

  14. computer clocks are not hardware clocks, but
    are “run” by hardware, the OS kernel.
    time.Now()
    MONOTONIC
    clock_gettime(CLOCK_REALTIME)
    sys call to get the value of a
    particular computer clock
    The system clock or wall clock.
    Gives the current UNIX timestamp.
    hardware clocks drift

    View full-size slide

  15. set from the hardware clock.
    (or external source like NTP).
    Real Time Clock (RTC)
    keeps UTC time
    at system boot
    “hey HPET,
    interrupt me in 10ms”
    then when interrupted,
    knows to increment by 10ms.
    “tickless” kernel:

    the interrupt interval (“tick”)
    is dynamically calculated.
    incr using a hardware ticker.
    subsequently
    the system clock is a counter kept by hardware, OS kernel.

    View full-size slide

  16. set from the hardware clock.
    (or external source like NTP).
    incr using a hardware ticker.
    these are the hardware clocks that drift.
    causes system clocks
    of different computers
    to change at different rates.
    at system boot subsequently
    the system clock is a counter kept by hardware, OS kernel.

    View full-size slide

  17. NTP is slow etc.
    synchronizes the system clock to a

    highly accurate clock network:
    need trusted, reachable NTP servers.
    NTP is slow, up to hundreds of ms over public internet.
    stepping results in discontinuous jumps in time.
    }
    gradually adjusts clock rate (“skew”)
    sets a new value (“step”)

    if differential is too large.

    View full-size slide

  18. The system clock keeps UNIX time
    increases by exactly 86, 400 seconds per day.
    So,1000th day after the epoch = 86400000 etc.
    …but a UTC day is not a constant 86, 400 seconds!
    “number of seconds since epoch”
    midnight UTC, 01.01.1970

    View full-size slide

  19. interlude: UTC
    messy compromise between:
    measured using
    atomic clocks
    atomic time
    based on
    the Earth’s rotation
    astronomical time
    very stable;
    this is what we want to use
    e.g. the (SI) second
    matches the Earth’s position;
    sometimes useful (we’re told)
    So, UTC:
    based on atomic time
    adjusted to be in sync with the Earth’s rotational period.

    View full-size slide

  20. interlude: UTC
    messy compromise between:
    based on
    the Earth’s rotation
    measured using
    atomic clocks
    atomic time astronomical time
    very stable;
    this is what we want to use
    e.g. the (SI) second
    matches the Earth’s position;
    sometimes useful (we’re told)
    but problem…

    View full-size slide

  21. the Earth’s rotation slows down over time.
    To compensate for this drift, UTC periodically adds a second.
    So, an astronomical day “takes longer” in
    absolute (atomic) terms.
    …so a UTC day may be 86, 400 or 86, 401 seconds!
    23:59:59
    23:59:60
    00:00:00
    leap second

    View full-size slide

  22. Unix time can’t represent the extra second, but

    want the computer’s “current time” to be aligned with UTC
    (in the long run):
    The system clock keeps UNIX time
    23:59:59
    23:59:59
    00:00:00
    repeats
    !
    Unix
    Unix time is not monotonic.
    23:59:59
    23:59:60
    00:00:00
    leap second
    UTC

    View full-size slide

  23. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v }
    N1
    N2
    example:
    fast

    View full-size slide

  24. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v2 }
    timestampB
    = 50
    B
    userX
    PUT { k: v }
    N1
    N2
    example:
    fast

    View full-size slide

  25. not synchronized, monotonic across nodes
    hardware clocks drift.
    NTP is slow etc.
    the system clock keeps Unix time.
    timestampA
    = 150
    A
    userX
    PUT { k: v2 }
    timestampB
    = 50
    B
    userX
    PUT { k: v }
    N1
    N2
    example:
    ruh roh.
    fast

    View full-size slide

  26. other
    timekeeping
    mechanisms

    View full-size slide

  27. prelude
    timekeeping mechanism used by a system depends on:
    desired consistency model

    what the valid timelines of events are
    desired availability

    how “responsive” the system is
    desired performance 

    read and write latency and so, throughput
    ] costs of
    higher consistency
    (CAP theorem, etc.)

    View full-size slide

  28. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View full-size slide

  29. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”
    savings
    N1
    checking
    N2

    View full-size slide

  30. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”
    savings
    N1
    N1
    G1
    N2
    checking
    N2
    G2

    View full-size slide

  31. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View full-size slide

  32. spanner
    • Distributed relational database

    supports distributed transactions
    • Horizontally scalable

    data is partitioned

    • Geo-replicated for fault tolerance
    • Performant
    • Externally consistent:

    “a globally consistent ordering of
    transactions that matches the observed
    commit order.”

    View full-size slide

  33. savings
    N1
    N1
    G1
    N2
    checking
    N2
    G2
    minimum total balance requirement = 200
    total balance = 200
    G1
    G2
    deposit 100
    T1
    debit 100
    T2
    US partition EU partition

    View full-size slide

  34. need
    desired consistency guarantees,
    desired performance: reads from replicas, 

    consistent snapshot reads
    consistent timeline across replicas.
    to order transactions across the system as well.
    the order to correspond to the observed commit order.
    want
    reads to never contain T2,
    if they don’t also contain T1.
    “globally consistent transaction order that corresponds to 

    observed commit order“.

    performant
    consensus.

    View full-size slide

  35. if T1
    commits before T2
    starts to commit,
    T1
    is ordered before T2.
    Can we enforce ordering using commit timestamps?
    order of transactions == observed order
    even if T1,
    T2
    across the globe!
    Yes, if perfectly synchronized clocks.
    …or, if you can know clock uncertainty perfectly,
    and account for it.
    }

    View full-size slide

  36. TrueTime
    tracks and exposes the uncertainty about perceived
    time across system clocks.
    t
    tt
    }
    explicitly represents time as an interval, not a point.
    TT.now() [earliest, latest]
    interval that contains “true now”.

    earliest is the earliest time that could be 

    “true now”; latest is the latest.

    View full-size slide

  37. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commit
    ts
    G1
    leader
    T1

    View full-size slide

  38. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    G1
    leader
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commit
    wait
    T1
    commit
    ts

    View full-size slide

  39. commit_ts(T1) = TT.now().latest
    waits for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commits and replies.
    G1
    leader
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T1
    commits
    guarantees
    commit_ts for next
    transaction is
    higher,
    despite different clocks.
    ]
    commit
    wait
    T1
    commit
    ts

    View full-size slide

  40. commit_ts(T2) = TT.now().latest
    wait for one full uncertainty window
    i.e. until commit_ts < TT.now().earliest
    then, commit and reply.
    G2
    leader
    T1
    commit
    ts
    if T1
    commits before T2
    starts to commit,
    T1
    ’s commit timestamps is smaller than T2
    ’s.
    T2
    T2
    commit
    ts
    commit
    wait
    commits

    View full-size slide

  41. TrueTime provides externally consistent transaction
    commit timestamps,
    so enables external consistency without coordination.
    …this is neat.
    The uncertainty window affects commit wait time, and so

    write latency and throughput.
    Google uses impressive and expensive! infrastructure
    to keep this small; ~7ms as of 2012.
    but note

    View full-size slide

  42. riak
    • Distributed key-value database:

    // A data item = 

    {“uuid1234”: {“name”:”ada”}}

    • Highly available:

    data partitioned and replicated,

    decentralized i.e. all replicas serve reads,
    writes.
    • Eventually consistent:

    “if no new updates are made to an object,
    eventually all accesses will return the last
    updated value.”

    View full-size slide

  43. if no new updates are made to an object,
    eventually all accesses will return the last updated value.
    timekeeping
    want:
    need:
    determine causal updates for convergence to latest.

    View full-size slide

  44. three replicas.
    read_quorum = write_quorum = 1.
    cart: [ ]
    { cart : [ A ] }
    N1
    N2
    N3
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    causal updates converge

    View full-size slide

  45. if no new updates are made to an object,
    eventually all accesses will return the last updated value.
    timekeeping
    want:
    need:
    determine causal updates for convergence to latest.
    any node serves reads and writes for availability
    determine conflicting updates.

    View full-size slide

  46. three replicas.
    read_quorum = write_quorum = 1.
    cart: [ ]
    { cart : [ A ] }
    N1
    N2
    N3
    userY
    { cart : [ B ] }
    userX
    concurrent updates conflict

    View full-size slide

  47. vector clocks
    logical clocks that use versions as “timestamps”.
    means to establish causal ordering.
    { cart : [ A ] }
    N1
    N2
    N3
    userY
    { cart : [ B ] }
    userX
    { cart : [ A ]}
    userX
    { cart : [ D ]}
    A B
    C D

    View full-size slide

  48. 0 0 0
    0 0 0 0 0 0
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    n1 n2 n3
    vector clocks

    View full-size slide

  49. 0 0 0
    0 0 0 0 0 0
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    n1 n2 n3
    1 0 0
    userX
    { cart : [ A ] }
    A
    0 0 1
    { cart : [ B ] }
    userY
    B
    vector clocks

    View full-size slide

  50. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    userX
    GET cart
    A
    C
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    (2, 0, 0)
    returns:
    vector clocks

    View full-size slide

  51. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    A
    C
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View full-size slide

  52. 0 0 0
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    0 1 0
    n1 n2 n3
    A
    C
    D
    B
    n1
    n2
    n3 n1
    n2
    n3
    n1
    n2
    n3
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View full-size slide

  53. 0 0 0
    2 1 0
    n1
    n2
    n3
    0 0 0
    1 0 0
    2 0 0
    0 0 0
    0 0 1
    n1 n2 n3
    n1
    n2
    n3 n1
    n2
    n3
    max ((2, 0, 0),
    (0, 1, 0))
    A
    C
    D
    B
    userX
    { cart : [ D ] }
    (2, 0, 0)
    vector clocks

    View full-size slide

  54. 2 1 0
    2 0 0 0 0 1
    n1 n2 n3
    { cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
    VCx
    ≺ VCy
    indicates x precedes y
    means to establish causal ordering.
    2 0 0
    2 1 0
    { cart : [ A ] } precedes { cart : [ D ] }
    vector clocks

    View full-size slide

  55. 2 1 0
    2 0 0 0 0 1
    n1 n2 n3
    { cart : [ A ] } { cart : [ D ] } { cart : [ B ] }
    If that doesn’t hold for x and y, they conflict
    VCx
    ≺ VCy
    indicates x precedes y
    means to establish causal ordering.
    { cart : [ D ] } conflicts with { cart : [ B ] }
    0 0 1
    2 1 0
    vector clocks

    View full-size slide

  56. timestampA
    = (1, 0)
    A
    userX
    PUT { k: v2 }
    timestampB
    = (1, 1)
    B
    userX
    PUT { k: v }
    N1
    N2
    faster by 10ms

    View full-size slide

  57. need to passed around.
    are divorced from physical time.
    but logical clocks
    logical clocks are a clever proxy for physical time.
    vector clocks,
    dotted version vectors, 

    a more precise form that Riak uses.
    …this is pretty neat too.

    View full-size slide

  58. stepping back…

    View full-size slide

  59. TrueTime

    + timestamps that correspond to wall-clock time.
    - specialized infrastructure.
    logical clocks

    + we can all do causality tracking! 

    - timestamps don’t correspond to wall-clock time.

    View full-size slide

  60. hybrid logical clocks

    augmented logical clocks:

    View full-size slide

  61. hybrid logical clocks

    augmented logical clocks:
    A
    N1
    N2
    faster by 20ms
    userX
    PUT { k: v }
    50
    40
    timestampB
    = <100, 2>
    timestampA
    = <50, 1>
    <50, 1>
    userX
    PUT { k: v2 }
    B
    55
    45
    <50, 1>

    View full-size slide

  62. “A person with a watch knows
    what time it is. A person with
    two watches is never sure.”
    - Segal’s Law, reworded.
    @kavya719
    speakerdeck.com/kavya719/keeping-time-in-real-systems
    Special thanks to Eben Freeman for reading drafts of this.

    View full-size slide

  63. Spanner

    Original paper:
    http://static.googleusercontent.com/media/research.google.com/en/us/archive/
    spanner-osdi2012.pdf


    Brewer’s 2017 paper:
    https://static.googleusercontent.com/media/research.google.com/en//pubs/
    archive/45855.pdf
    Dynamo
    http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

    Logical Clocks

    http://amturing.acm.org/p558-lamport.pdf

    Dotted Version Vectors
    https://arxiv.org/abs/1011.5808

    Hybrid Logical Clocks
    https://www.cse.buffalo.edu//tech-reports/2014-04.pdf

    View full-size slide

  64. replicas must agree on the order of transactions.
    consistent timeline across replicas
    N1
    N1
    G
    …is logical proxy for physical time.
    provides a unified timeline across nodes.
    leader proposes write to other replicas,

    write commits iff n replicas ACK it.
    Spanner uses Paxos, 2PC
    (other protocols are 3PC, Raft, Zab).
    consensus

    View full-size slide

  65. compromises availability —
    if n replicas are not be available to ACK writes.
    compromises performance — 

    increases write latency, decreases throughput;

    multiple coordination rounds until a write commits.
    but consensus
    … so, don’t want to use consensus
    to order transactions across partitions.
    e.g. T1,
    T2

    View full-size slide

  66. happens-before
    X ≺ Y IF one of:
    — same actor
    — are a synchronization pair
    — X ≺ E ≺ Y
    across actors.
    IF X not ≺ Y and Y not ≺ X ,
    concurrent!
    orders events
    Formulated in Lamport’s 

    Time, Clocks, and the
    Ordering of Events paper
    in 1978.
    establishes causality and
    concurrency.
    (threads or nodes)

    View full-size slide

  67. A ≺ C (same actor)
    C ≺ D (synchronization pair)
    So, A ≺ D (transitivity)
    causality and concurrency
    A B
    C D
    N1
    N2
    N3

    View full-size slide

  68. …but B ? D

    D ? B
    So, B, D concurrent!
    A B
    C D
    N1
    N2
    N3
    causality and concurrency

    View full-size slide

  69. A B
    C D
    N1
    N2
    N3
    { cart : [ A ] }
    { cart : [ B ] }
    { cart : [ A ]} { cart : [ D ]}
    A ≺ D

    D should update A

    B, D concurrent
    B, D need resolution

    View full-size slide

  70. GET, PUT operations on a key pass around a casual context object,
    that contains the vector clocks.
    a more precise form,

    “dotted version vector”
    Riak stores a vector clock with each version of the data.
    Therefore, able to determine causal updates versus conflicts.

    View full-size slide

  71. conflict resolution in riak
    Behavior is configurable.

    Assuming vector clock analysis enabled:

    • last-write-wins

    i.e. version with higher timestamp picked.
    • merge, iff the underlying data type is a CRDT
    • return conflicting versions to application

    riak stores “siblings” or conflicting versions,

    returned to application for resolution.

    View full-size slide

  72. return conflicting versions to application:
    0 0 1
    2 1 0
    D: { cart: [ “date crepe” ] }
    B: { cart: [ “blueberry crepe” ] }
    Riak stores both versions
    next op returns both to application
    application must resolve conflict
    { cart: [ “blueberry crepe”, “date crepe” ] }
    2 1 1
    which creates a causal update
    { cart: [ “blueberry crepe”, “date crepe” ] }

    View full-size slide

  73. …what about resolving those conflicts?
    doesn’t
    (default behavior).
    instead, exposes happens-before graph
    to the application for conflict resolution.

    View full-size slide