Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BDI SIG tech talk

E1923013dacab39eb231a2fffbf7b33c?s=47 UENISHI Kota
December 18, 2015

BDI SIG tech talk



December 18, 2015


  1. Replication in Basho Products, and the latest one: Machi 2015/12/18

    BDI SIG Kota UENISHI Basho Japan KK
  2. Replication: Never Lose Data •Hold multiple copies to tolerate media

    failure •If a copy is lost, recover it from remaining copies •Keep all of them up to date, preventing from implicit overwrite
  3. History of Replication ɹ…as of Basho • Riak 0.x: Quorum-based

    repl with vector clocks based on Client Ids • Riak 1.x: Quorum-based repl with vector clocks based on Vnode Ids • Riak 2.0: CRDTs and DVVs • Riak 2.x: MultiPaxos • Machi: Chain Replication
  4. “High” Availability • One of CAP Theorem indications: • No

    database can keep consistency (of atomic objects) and availability under network partition • One of FLP impossibility indications: • No consensus; Atomic objects (nodes) cannot be maintained consistent under failure of majority and delay • Network is not reliable[7] and no database has tried to tolerate these failures before Dynamo and Riak
  5. Causal Consistency • Causally related writes must be seen in

    the same order, while concurrent writes may be seen in different order in different observers • Alternative guarantee in available system that tolerates any kind of network partition • Works under network partition, failures, delay, whatever, because it does not need coordination (vs Strong Consistency) • Allow multiple values but tracking causality of those values matters
  6. Replication in Riak; track the causality R1 R2 R3 V1.0

    V1.1 V1.0 V2.0 [V1.1, V2.0] C1 C2
  7. Real Issues • Deletion, or KV679 • Sibling Explosion (Not

    VC explosion) • Read-before-write performance to track causality • Application semantics are unclear
  8. Resolutions • Deletion, or KV679 -> Epoch, (GVV) • Sibling

    Explosion -> DVV • Read-before-write performance to track causality -> Write Once bucket • Application semantics are unclear -> CRDTs
  9. Sibling Explosion • All data of concurrent PUTs must be

    kept as siblings • Caused by hot key, slow clients, network partition, failure, bugs, retry, etc • Even just by two clients! • Better than losing data, but screws up VM, GC, backend, etc C2
  10. Sibling Explosion Example R1 R2 R3 put(B) C1 C2 B

    B put(C) [B,C] C put(D) put(E) [B,C,D] D [B,C,D,E] put(F) [B,C,D,E,F] E F
  11. Causality graph with vanilla VV B [B,C] C [B,C,D] D

    [B,C,D,E] [B,C,D,E,F] E F
  12. Dotted Version Vectors • Mark a sibling with “Dot” that

    indicates which actor has added that sibling. • B can be overwritten by D, D for F, C for E
  13. With DVV R1 R2 R3 put(B) C1 C2 B B

    put(C) [B,C] C put(D) put(E) [C,D] D [D,E] put(F) [E,F] E F
  14. Causality graph with DVV B [B,C] C [C,D] D [D,E]

    [E,F] E F
  15. Microbenchmark 0 20 40 60 80 100 0 10 20

    30 40 50 60 70 80 Number of Siblings Duration [sec] Number of siblings in DVV=false, 4 nodes concurrency=8 concurrency=16 concurrency=32 0 20 40 60 80 100 0 50 100 150 200 250 Number of Siblings Duration [sec] Number of siblings in DVV=true, 4 nodes concurrency=8 concurrency=16 concurrency=32 With sibling_benchmark.erl: http://bit.ly/1TPcwwc
  16. Riak CS • A highly available dumb-file service that exposes

    Amazon S3 API • Breaks large object to 1MB chunks and stores as write-once registers of Riak keys • Chunks are write once registers, and each manifest points to single version described by UUID • ~10^3 TB or more
  17. None
  18. Real Issues in Riak CS •real issues based on AP

    database •read IO traffic by… •AAE tree build •Garbage Collection and chunk deletion keeping track of causality •Slow disk space reclaim
  19. Making Write Once Registers on top of Riak •Real issues

    based on AP database •Not a good design :P none data tombstone dead PUT DELETE reap read-before x2 read-before merge none
  20. Machi • Chain replication with write-once registers, keys assigned by

    the system • No implicit overwrite happens • No need to track causality • Every byte has simple state transition: • unwritten => written => trimmed • Choose AP (eventual consistency) and CP (strong consistency) mode with single replication protocol
  21. Machi API (Simplified) • append(Prefix, Bytes) => {Filename, Offset} •

    write(Filename, Offset, Bytes) => ok/error • read(Filename, Offset, Size) => Bytes • trim(Filename, Offset, Size) => ok/error
  22. Chain Replication in Machi R1 R2 R3 C1 append(x) C2

    read(x) trim(x)
  23. Making Write Once Registers on top of Machi • No

    need to read-before-write • Need to remember assigned filename unwritten written trimmed append trim
  24. AP and CP mode • In case of network partition…

    • AP mode • Split the chain and manipulate heads • Assign different names from same prefix • CP mode • Replicate until live nodes in chain is still majority [H, M1, M2, T] [H, M1, T], [M2] [H, M1] [M2, T]
  25. Questions?

  26. Important Resources • [1] Riak source code: https://github.com/basho/riak • [2]

    Machi source code: https://github.com/basho/machi • [3] CORFU: A Distributed Shared Log for Flash Clusters • http://research.microsoft.com/apps/pubs/default.aspx?id=157204 • [4] A Brief History of Time in Riak • https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak • https://www.youtube.com/watch?v=3SWSw3mKApM • [5] Managing Chain Replication with Humming Consensus • http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf • https://www.youtube.com/watch?v=yR5kHL1bu1Q • [6] Version Vectors are not Vector Clocks, [7] The Network is Reliable