Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BDI SIG tech talk

UENISHI Kota
December 18, 2015

BDI SIG tech talk

UENISHI Kota

December 18, 2015
Tweet

More Decks by UENISHI Kota

Other Decks in Technology

Transcript

  1. Replication in Basho Products,
    and the latest one: Machi
    2015/12/18 BDI SIG
    Kota UENISHI
    Basho Japan KK

    View Slide

  2. Replication:
    Never Lose Data
    •Hold multiple copies to tolerate
    media failure
    •If a copy is lost, recover it from
    remaining copies
    •Keep all of them up to date,
    preventing from implicit overwrite

    View Slide

  3. History of Replication
    ɹ…as of Basho
    • Riak 0.x: Quorum-based repl with vector
    clocks based on Client Ids
    • Riak 1.x: Quorum-based repl with vector
    clocks based on Vnode Ids
    • Riak 2.0: CRDTs and DVVs
    • Riak 2.x: MultiPaxos
    • Machi: Chain Replication

    View Slide

  4. “High” Availability
    • One of CAP Theorem indications:
    • No database can keep consistency (of atomic
    objects) and availability under network partition
    • One of FLP impossibility indications:
    • No consensus; Atomic objects (nodes) cannot be
    maintained consistent under failure of majority and
    delay
    • Network is not reliable[7] and no database has tried to
    tolerate these failures before Dynamo and Riak

    View Slide

  5. Causal Consistency
    • Causally related writes must be seen in the same
    order, while concurrent writes may be seen in
    different order in different observers
    • Alternative guarantee in available system that
    tolerates any kind of network partition
    • Works under network partition, failures, delay,
    whatever, because it does not need coordination (vs
    Strong Consistency)
    • Allow multiple values but tracking causality of those
    values matters

    View Slide

  6. Replication in Riak; track the causality
    R1
    R2
    R3
    V1.0 V1.1
    V1.0 V2.0
    [V1.1, V2.0]
    C1
    C2

    View Slide

  7. Real Issues
    • Deletion, or KV679
    • Sibling Explosion (Not VC explosion)
    • Read-before-write performance to track
    causality
    • Application semantics are unclear

    View Slide

  8. Resolutions
    • Deletion, or KV679 -> Epoch, (GVV)
    • Sibling Explosion -> DVV
    • Read-before-write performance to track
    causality -> Write Once bucket
    • Application semantics are unclear ->
    CRDTs

    View Slide

  9. Sibling Explosion
    • All data of concurrent PUTs must be kept as siblings
    • Caused by hot key, slow clients, network partition, failure,
    bugs, retry, etc
    • Even just by two clients!
    • Better than losing data, but screws up VM, GC, backend,
    etc
    C2

    View Slide

  10. Sibling Explosion Example
    R1
    R2
    R3
    put(B)
    C1
    C2
    B
    B
    put(C)
    [B,C]
    C
    put(D)
    put(E)
    [B,C,D]
    D
    [B,C,D,E]
    put(F)
    [B,C,D,E,F]
    E
    F

    View Slide

  11. Causality graph
    with vanilla VV
    B
    [B,C]
    C
    [B,C,D]
    D
    [B,C,D,E] [B,C,D,E,F]
    E
    F

    View Slide

  12. Dotted Version Vectors
    • Mark a sibling with “Dot” that indicates which actor
    has added that sibling.
    • B can be overwritten by D, D for F, C for E

    View Slide

  13. With DVV
    R1
    R2
    R3
    put(B)
    C1
    C2
    B
    B
    put(C)
    [B,C]
    C
    put(D)
    put(E)
    [C,D]
    D
    [D,E]
    put(F)
    [E,F]
    E
    F

    View Slide

  14. Causality graph with
    DVV
    B
    [B,C]
    C
    [C,D]
    D
    [D,E] [E,F]
    E
    F

    View Slide

  15. Microbenchmark
    0
    20
    40
    60
    80
    100
    0 10 20 30 40 50 60 70 80
    Number of Siblings
    Duration [sec]
    Number of siblings in DVV=false, 4 nodes
    concurrency=8
    concurrency=16
    concurrency=32
    0
    20
    40
    60
    80
    100
    0 50 100 150 200 250
    Number of Siblings
    Duration [sec]
    Number of siblings in DVV=true, 4 nodes
    concurrency=8
    concurrency=16
    concurrency=32
    With sibling_benchmark.erl: http://bit.ly/1TPcwwc

    View Slide

  16. Riak CS
    • A highly available dumb-file service that exposes
    Amazon S3 API
    • Breaks large object to 1MB chunks and stores as
    write-once registers of Riak keys
    • Chunks are write once registers, and each manifest
    points to single version described by UUID
    • ~10^3 TB or more

    View Slide

  17. View Slide

  18. Real Issues in Riak CS
    •real issues based on AP database
    •read IO traffic by…
    •AAE tree build
    •Garbage Collection and chunk
    deletion keeping track of causality
    •Slow disk space reclaim

    View Slide

  19. Making Write Once
    Registers on top of Riak
    •Real issues based on AP database
    •Not a good design :P
    none data tombstone dead
    PUT DELETE reap
    read-before x2 read-before
    merge
    none

    View Slide

  20. Machi
    • Chain replication with write-once registers, keys
    assigned by the system
    • No implicit overwrite happens
    • No need to track causality
    • Every byte has simple state transition:
    • unwritten => written => trimmed
    • Choose AP (eventual consistency) and CP (strong
    consistency) mode with single replication protocol

    View Slide

  21. Machi API (Simplified)
    • append(Prefix, Bytes) => {Filename, Offset}
    • write(Filename, Offset, Bytes) => ok/error
    • read(Filename, Offset, Size) => Bytes
    • trim(Filename, Offset, Size) => ok/error

    View Slide

  22. Chain Replication in Machi
    R1
    R2
    R3
    C1
    append(x)
    C2
    read(x)
    trim(x)

    View Slide

  23. Making Write Once
    Registers on top of Machi
    • No need to read-before-write
    • Need to remember assigned filename
    unwritten written trimmed
    append trim

    View Slide

  24. AP and CP mode
    • In case of network partition…
    • AP mode
    • Split the chain and manipulate
    heads
    • Assign different names from
    same prefix
    • CP mode
    • Replicate until live nodes in
    chain is still majority
    [H, M1, M2, T]
    [H, M1, T], [M2]
    [H, M1]
    [M2, T]

    View Slide

  25. Questions?

    View Slide

  26. Important Resources
    • [1] Riak source code: https://github.com/basho/riak
    • [2] Machi source code: https://github.com/basho/machi
    • [3] CORFU: A Distributed Shared Log for Flash Clusters
    • http://research.microsoft.com/apps/pubs/default.aspx?id=157204
    • [4] A Brief History of Time in Riak
    • https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak
    • https://www.youtube.com/watch?v=3SWSw3mKApM
    • [5] Managing Chain Replication with Humming Consensus
    • http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf
    • https://www.youtube.com/watch?v=yR5kHL1bu1Q
    • [6] Version Vectors are not Vector Clocks, [7] The Network is Reliable

    View Slide