Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Progressive Systems

Progressive Systems

An early discussion of Progressive Systems, CAP and CALM, using RAMP and bW-trees as examples of Progressive design. Presented at LinkedIn NYC 9/30/2015.

Joe Hellerstein

September 30, 2015
Tweet

More Decks by Joe Hellerstein

Other Decks in Programming

Transcript

  1. Progressive Systems
    Joe Hellerstein Berkeley/Trifacta

    View full-size slide

  2. What slows us down?

    View full-size slide

  3. What slows us down?

    View full-size slide

  4. What slows us down?

    View full-size slide

  5. What slows us down?

    View full-size slide

  6. What slows us down?

    View full-size slide

  7. What slows us down?
    Coordination
    Signals
    Barriers
    Communication

    View full-size slide

  8. But we need coordination, right?

    View full-size slide

  9. This is familiar...
    Coordination
    Locks
    Latches
    Mutexes
    Semaphores
    Compute
    Barriers
    Distributed
    Coordination

    View full-size slide

  10. The CAP Theorem
    {CA} {AP} {CP}
    •  Consistency, Availability, Partitioning.
    – Choose 2.
    •  Why?
    – Consistency requires coordination!
    •  And you can’t coordinate without communication.

    View full-size slide

  11. Partitions don’t happen very often

    View full-size slide

  12. But coordination still slows us
    down

    View full-size slide

  13. The first principle of successful scalability is to 

    batter the consistency mechanisms down to a minimum,
    move them off the critical path, 

    hide them in a rarely visited corner of the system, 

    and then 

    make it as hard as possible 

    for application developers 

    to get permission 

    to use them

    —James Hamilton 

    (IBM, MS, Amazon)

    System Poetry

    View full-size slide

  14. The first principle of successful scalability is to 

    batter the consistency mechanisms down to a minimum,
    move them off the critical path, 

    hide them in a rarely visited corner of the system, 

    and then 

    make it as hard as possible 

    for application developers 

    to get permission 

    to use them

    —James Hamilton 

    (IBM, MS, Amazon)

    System Poetry

    View full-size slide

  15. The first principle of successful scalability is to 

    batter the consistency mechanisms down to a minimum,
    move them off the critical path, 

    hide them in a rarely visited corner of the system, 

    and then 

    make it as hard as possible 

    for application developers 

    to get permission 

    to use them

    —James Hamilton 

    (IBM, MS, Amazon)

    coordination
    System Poetry

    View full-size slide

  16. The CAP Theorem
    {CA} {AP} {CP}
    •  Consistency, Availability, Partitioning.
    – Choose 2.
    •  Why?
    – Consistency requires coordination!
    •  And you can’t coordinate without communication.
    L
    L L
    Latency
    Latency

    View full-size slide

  17. The CAP Theorem
    {CA} {AP} {CP}
    Coordination is too expensive.

    View full-size slide

  18. The CAP Theorem
    {CA} {AP} {CP}
    We have to sacrifice Consistency!
    Closed Closed

    View full-size slide

  19. Mayhem Ensues

    View full-size slide

  20. So, when is coordination required?

    View full-size slide

  21. The CALM Theorem
    ±
    KEEP CALM

    View full-size slide

  22. ±
    CALM
    Consistency
    As
    Logical
    Monotonicity

    View full-size slide

  23. ±
    CALM
    Consistency
    As
    Logical
    Monotonicity
    All processes
    respect invariants
    and agree on outcomes
    regardless of message
    ordering.

    View full-size slide

  24. ±
    CALM
    Consistency
    As
    Logical
    Monotonicity
    Program logic ensures
    that all state always
    makes progress in one
    direction.
    Once a fact is known, it
    never changes.

    View full-size slide

  25. ±
    CALM
    FORMALLY
    Theorem (CALM):
    A program specification has a
    consistent, coordination-free
    implementation if and only if
    its logic is monotone.
    Avoids  
    coordina-on  
    Monotone  

    View full-size slide

  26. ±
    CALM
    NOTE
    CALM precisely answers
    the question of when one
    can get Consistency
    without Coordination*.
    It does not tell you how
    to achieve this goal!
    *i.e. when CAP does not hold

    View full-size slide

  27. Progressive Systems
    Systems built upon
    monotonically growing state.
    Logs  
    Counters  
    Vector  Clocks  
    Immutable  
    Variables  
    Deltas  

    View full-size slide

  28. Two Recent Examples
    •  RAMP Transactions (Global-scale system)
    •  Bw-Tree (In-Memory Index)

    View full-size slide

  29. RAMP
    Scalable Atomic Visibility with RAMP
    Transactions. P Bailis, A Fekete, A Ghodsi,
    JM Hellerstein, I Stoica. SIGMOD 2014.
    Slides  courtesy  Peter  Bailis  

    View full-size slide

  30. Social Graph
    1.2B+ vertices
    420B+ edges
    Facebook

    View full-size slide

  31. Social Graph
    1 2, 3, 5
    User Adjacency List
    2 1, 3, 5
    3 1, 5, 6
    4 6
    5 1, 2, 3, 6
    6 3, 4, 5
    1.2B+ vertices
    420B+ edges
    Facebook

    View full-size slide

  32. 1 2, 3, 5 6 3, 4, 5
    ,6! ,1!
    To preserve graph,
    should observe either:
    »  Both links
    »  Neither link
    Atomic Visibility!

    View full-size slide

  33. Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    X = 1
    WRITE
    Y = 1
    WRITE
    either all or none of each transaction’s updates
    should be visible to other transactions

    View full-size slide

  34. BUT NOT
    Atomic Visibility
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    either all or none of each transaction’s updates
    should be visible to other transactions
    OR
    X = 1
    READ
    Y = 1
    READ
    READ X =
    READ Y =
    “FRACTURED READS”

    View full-size slide

  35. Atomic Visibility
    is pretty useful:
    Maintain  Index  
    an attending doctor
    each patient
    Seen  By  

    View full-size slide

  36. RAMP: Basic State
    On  each  Node  
    Every  transac-on  has  a  (one-­‐way)    
    ready  bit  at  each  node.  
     
    Every  node  has  a  (monotonically  increasing)    
    highest  -mestamp  commiJed.  
     
    Immutable  data  with  (monotonically  increasing)  
    -mestamps  
    Every  transac-on  is  assigned  a  -mestamp  from  a  
    (monotonically  increasing)  counter.  
    T13  
    T13  
    ✓  
    X=0  
    T10  
    X=1  
    T13  

    View full-size slide

  37. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002

    View full-size slide

  38. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    X=0 Y=0
    Server 1002
    X=1
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0
    via  inten(on  
    metadata  

    View full-size slide

  39. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    Server 1001
    Y=0
    Server 1002
    X=1
    via  inten(on  
    metadata  

    View full-size slide

  40. value
    Y=0 T0 {}
    intention
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · T0
    intention
    ·
    via  inten(on  
    metadata  
    “A transaction called T1 wrote this and also wrote to Y”

    View full-size slide

  41. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    · value
    Y=0 T0 {}
    intention
    ·
    via  inten(on  
    metadata  
    X = ?
    R
    Y = ?
    R

    View full-size slide

  42. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    value
    X=1 T1 {Y}
    intention
    ·
    via  inten(on  
    metadata  
    X = ?
    R
    Y = ?
    R
    X = 1
    Y = 0
    Where is T1’s write to Y?
    value
    Y=0 T0 {}
    intention
    ·
    “A transaction called T1 wrote this and also wrote to Y”
    via    
    mul(-­‐versioning,  
    ready  bit  

    View full-size slide

  43. Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    X = 1
    W
    Y = 1
    W
    value
    X=1 T1 {Y}
    intention
    ·
    via  inten(on  
    metadata  
    via    
    mul(-­‐versioning,  
    ready  bit  
    value
    Y=0 T0 {}
    intention
    ·

    View full-size slide

  44. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via  inten(on  
    metadata  
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    1.) Place write on each server.
    2.) Set ready bit on each
    write on server.
    via    
    mul(-­‐versioning,  
    ready  bit  
    Ready bit monotonicity: once ready bit is set, all writes in
    transaction are present on their respective servers

    View full-size slide

  45. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via  inten(on  
    metadata  
    via    
    mul(-­‐versioning  
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = 1
    W
    Y = 1
    W
    X = ?
    R
    Y = ?
    R
    Ready bit monotonicity: once ready bit is set, all writes in
    transaction are present on their respective servers

    View full-size slide

  46. Y=1 T1 {X}
    ·
    X=1 T1 {Y}
    ·
    Atomic Visibility via RAMP Transactions
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via  inten(on  
    metadata  
    via    
    mul(-­‐versioning  
    value intention
    X=0 T0 {}
    · value intention
    Y=0 T0 {}
    ·
    X = ?
    R
    Y = ?
    R
    1.) Fetch “highest” ready versions.
    2.) Fetch any missing writes
    using metadata.
    X = 1
    Y = 0
    Y = 1
    Ready bit monotonicity: once ready bit is set, all writes in
    transaction are present on their respective servers

    View full-size slide

  47. RAMP Variants
    Algorithm Write RTT
    READ RTT
    (best case)
    READ RTT
    (worst case)
    METADATA
    RAMP-Fast 2 1 2 O(txn len)
    write set summary
    RAMP-Small 2 2 2 O(1)
    timestamp
    RAMP-Hybrid 2 1+ε 2 O(B(ε))
    Bloom filter
    REPAIR
    ATOMICITY
    DETECT
    RACES
    via  inten(on  
    metadata  
    via    
    mul(-­‐versioning,  
    ready  bit  

    View full-size slide

  48. RAMP-Hybrid
    YCSB: WorkloadA, 95% reads, 1M items, 4 items/txn
    No Concurrency Control
    Serializable 2PL
    Write Locks Only
    RAMP-Fast
    RAMP-Small

    View full-size slide

  49. No Coordination On
    This RAMP

    View full-size slide

  50. Bw-Trees
    The Bw-Tree: A B-tree for New Hardware
    Platforms. JJ Levandoski, DB Lomet, S
    Sengupta. ICDE 2013.

    View full-size slide

  51. In-Memory SQL Performance
    Analysis  
    •  Improve  CPI?    
    < 2x benefit
    •  Improving  mul-core  scalability?    
    <2x benefit
    Solution: reduce # of instructions per transaction
    By a LOT!
     10x  faster?    90%  fewer  instruc-ons  
     100x  faster?    99%  fewer  instruc-ons  
    Q:  Where  are  the  inner-­‐loop  instruc-ons?  
    A: Index access
    •  especially latching and locking
    Answer: no latches, no locks.
    i.e.
    Avoid Coordination
    Diaconu,  et  al.    “Hekaton:  SQL  
    Server’s  Memory-­‐Op-mized  
    OLTP  Engine”.    SIGMOD  2013.  

    View full-size slide

  52. The Bw-Tree: What is it?
    A Latch-free, Log-structured B-tree for
    Multi-core Machines with Large Main
    Memories and Flash Storage
    Bw = Buzz Word
    No coordination Progressive!

    View full-size slide

  53. Delta Updates
    Page P
    PID Physical
    Address
    P
    Mapping Table
    Δ: Insert record 50
    Δ: Delete record 48
    Δ: Update record 35 Δ: Insert Record 60
    •  Each page update produces a new address (the delta).
    •  Install new page address in map using compare-and-swap.
    •  Only one winner on concurrent update to the same address.
    •  Eventually install new consolidate page with deltas applied.
    •  Single-page updates are easy, solved node splits and deletes.
    Consolidated Page P
    Bw-Tree Delta Updates
    Coordina-on  happens  here,  
    via  CAS  instruc-on.  
    A  monotonic  log  of  updates.  
    A  monotonic  accumula-on  
    of  versions  

    View full-size slide

  54. solidation that creates a new “re-organized” base page contain-
    ing all the entries from the original base page as modified by
    the updates from the delta chain. We trigger consolidation if
    an accessor thread, during a page search, notices a delta chain
    length has exceeded a system threshold. The thread performs
    consolidation after attempting its update (or read) operation.
    When consolidating, the thread first creates a new base page
    (a new block of memory). It then populates the base page with
    a sorted vector containing the most recent version of a record
    from either the delta chain or old base page (deleted records
    are discarded). The thread then installs the new address of
    the consolidated page in the mapping table with a CAS. If
    it succeeds, the thread requests garbage collection (memory
    reclamation) of the old page state. Figure 2(b) provides an
    example depicting the consolidation of page P that incorpo-
    rates deltas into a new “Consolidated Page P”. If this CAS
    fails, the thread abandons the operation by deallocating the
    new page. The thread does not retry, as a subsequent thread
    will eventually perform a successful consolidation.
    C. Range Scans
    A range scan is specified by a key range (low key, high
    key). Either of the boundary keys can be omitted, meaning that
    one end of the range is open-ended. A scan will also specify
    either an ascending or descending key order for delivering the
    records. Our description here assumes both boundary keys are
    provided and the ordering is ascending. The other scan options
    are simple variants.
    A scan maintains a cursor providing a key indicating how
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    (a) Creating sibling page Q
    LPID Ptr
    P
    O
    Q
    CA
    (b) I
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Pa
    Split ∆
    CAS
    Index entry ∆
    (c) Installing index entry
    Fig. 3. Split example. Dashed arrows represent
    arrows represent physical pointers.
    deallocate the old page state while anoth
    it. Similar concerns arise when a page
    Bw-tree. That is, other threads may s
    the now removed page. We must prote
    accessing reclaimed and potentially “re
    preventing reclamation until such access
    This is done by a thread executing with
    An epoch mechanism is a way of
    ing deallocated from being re-used too
    joins an epoch when it wants to prote
    (e.g., searching) from being reclaimed. I
    this dependency is finished. Typically,
    Q  
    P   R  
    O  
    P   R  
    Q  
    O  
    P   R  
    Q  
    O  
    on that creates a new “re-organized” base page contain-
    he entries from the original base page as modified by
    ates from the delta chain. We trigger consolidation if
    ssor thread, during a page search, notices a delta chain
    as exceeded a system threshold. The thread performs
    dation after attempting its update (or read) operation.
    n consolidating, the thread first creates a new base page
    block of memory). It then populates the base page with
    vector containing the most recent version of a record
    her the delta chain or old base page (deleted records
    arded). The thread then installs the new address of
    solidated page in the mapping table with a CAS. If
    eds, the thread requests garbage collection (memory
    tion) of the old page state. Figure 2(b) provides an
    e depicting the consolidation of page P that incorpo-
    ltas into a new “Consolidated Page P”. If this CAS
    e thread abandons the operation by deallocating the
    ge. The thread does not retry, as a subsequent thread
    ntually perform a successful consolidation.
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    (a) Creating sibling page Q
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    Split ∆
    CAS
    (b) Installing split delta
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    Split ∆
    CAS
    Index entry ∆
    (c) Installing index entry delta
    Fig. 3. Split example. Dashed arrows represent logical pointers, while solid
    arrows represent physical pointers.
    deallocate the old page state while another thread still accesses
    it. Similar concerns arise when a page is removed from the
    solidation that creates a new “re-organized” base page contain-
    ing all the entries from the original base page as modified by
    the updates from the delta chain. We trigger consolidation if
    an accessor thread, during a page search, notices a delta chain
    length has exceeded a system threshold. The thread performs
    consolidation after attempting its update (or read) operation.
    When consolidating, the thread first creates a new base page
    (a new block of memory). It then populates the base page with
    a sorted vector containing the most recent version of a record
    from either the delta chain or old base page (deleted records
    are discarded). The thread then installs the new address of
    the consolidated page in the mapping table with a CAS. If
    it succeeds, the thread requests garbage collection (memory
    reclamation) of the old page state. Figure 2(b) provides an
    example depicting the consolidation of page P that incorpo-
    rates deltas into a new “Consolidated Page P”. If this CAS
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    (a) Creating sibling page Q
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Pa
    Split ∆
    CAS
    (b) Installing split delt
    LPID Ptr
    Page P
    P
    O
    Q
    Page Q
    Page R
    Split ∆
    CAS
    Index entry ∆
    (c) Installing index entry delta
    Fig. 3. Split example. Dashed arrows represent logical pointers, whi
    arrows represent physical pointers.
    Page  “updates”  are  actually  
    appends  to  a  progressively  
    growing  log.  Only  Ptrs  are  
    muta-ng  (vis  CAS  instruc-on).  
    Page Splits

    View full-size slide

  55. 10.40
    3.83
    2.84
    0.56 0.66
    0.33
    0.0
    2.0
    4.0
    6.0
    8.0
    10.0
    12.0
    Xbox Synthetic Deduplication
    Operations/Sec (M)
    BW-Tree BerkeleyDB
    Fig. 6. Bw-tree and BerkeleyDB
    over linked delta chains are good for branch prediction and
    prefetching in general, the Xbox workload has large 100-byte
    records, meaning fewer deltas will fit into the L1 cache during
    a scan. The synthetic workload contains small 8-byte keys,
    which are more amenable to prefetching and caching. Thus,
    delta chain lengths can grow longer (to about eight deltas)
    without performance consequences.
    Synthetic workl
    Read-only work
    BW-TR
    In general, we bel
    rior performance of t
    blocks on updates or
    uses page-level latch
    ducing concurrency.
    utilization of about 9
    (2) CPU cache efficie
    to update immutable
    threads are rarely inv
    leyDB updates page
    tree page involves in
    ordered records, on
    invalidating multiple
    D. Comparing Bw-tr

    View full-size slide

  56. Reflection
    •  CAP? CALM.
    – Nothing in PTime requires coordination
    •  Wow
    – But CALM only tells us what’s possible
    •  Not how to do it.
    •  How do we get good at designing
    progressive systems?

    View full-size slide

  57. Getting Progressive
    1.  Design patterns
    –  Use a log as ground truth
    •  Derive data structures via “queries” over the streaming log
    –  Use versions, not mutable state
    –  ACID 2.0: Associative, Commutative, Idempotent
    –  Your ideas go here...
    2.  Libraries and Languages
    –  CRDTs are monotonic data types
    •  Have to link them together carefully
    –  Bloom and Eve are languages whose compilers can test for
    monotonicity

    View full-size slide

  58. More?
    Declarative Networking: Recent Theoretical Work on Coordination,
    Correctness, and Declarative Semantics. T Ameloot. SIGMOD Record
    2014.
    Scalable Atomic Visibility with RAMP Transactions. P Bailis, A Fekete,
    A Ghodsi, JM Hellerstein, I Stoica. SIGMOD 2014.
    The Bw-Tree: A B-tree for New Hardware Platforms. JJ Levandoski,
    DB Lomet, S Sengupta. ICDE 2013.
    http://boom.cs.berkeley.edu
    http://bit.ly/progressiveseminar

    View full-size slide

  59. Backup Slides

    View full-size slide

  60. Spanner?
    latency (ms)
    operation mean std dev count
    all reads 8.7 376.4 21.5B
    single-site commit 72.3 112.8 31.2M
    multi-site commit 103.0 52.2 32.1M
    Table 6: F1-perceived operation latencies measured over the
    course of 24 hours.
    of such tables are extremely uncommon. The F1 team
    has only seen such behavior when they do untuned bulk
    data loads as transactions.
    The
    cated s
    tation [
    store th
    cation.
    terface
    scribe a
    Their p
    phase c
    mit ove
    a varian
    across
    10  TPS!  
    [Corbett, et al. “Spanner:…”, OSDI12]

    View full-size slide

  61. Speed of light
    7 global round-trips per sec
    7

    View full-size slide

  62. Facebook Tao
    Google Megastore
    LinkedIn Espresso
    Due to coordination overheads…
    Amazon DynamoDB
    Apache Cassandra
    Basho Riak
    Yahoo! PNUTS
    …consciously choose to
    violate atomic visibility
    “[Tao]  explicitly  favors  efficiency  
    and  availability  over    consistency…
    [an  edge]  may  exist  without  an  
    inverse;  these  hanging  associa-ons  
    are  scheduled  for  repair  by  an  
    asynchronous  job.”  
    Google App Engine

    View full-size slide

  63. r(x)=0!
    w(x←1)!
    w(y←1)!
    r(y)=0!
    Should have
    r(y)!1
    r(y)=0!
    w(x←1)!
    2
    r(x)=0!
    w(y←1)!
    1
    Should have
    r(x)!1
    r(y)=0!
    w(x←1)!
    1
    r(x)=0!
    w(y←1)!
    2
    CONCURRENT EXECUTION
    IS NOT SERIALIZABLE!
    Atomic Visibility
    is not serializability!
    …but respects
    Atomic Visibility!

    View full-size slide