BDI SIG tech talk - Speaker Deck

Slide 1

Slide 1 text

Replication in Basho Products, and the latest one: Machi 2015/12/18 BDI SIG Kota UENISHI Basho Japan KK

Slide 2

Slide 2 text

Replication: Never Lose Data •Hold multiple copies to tolerate media failure •If a copy is lost, recover it from remaining copies •Keep all of them up to date, preventing from implicit overwrite

Slide 3

Slide 3 text

History of Replication ɹ…as of Basho • Riak 0.x: Quorum-based repl with vector clocks based on Client Ids • Riak 1.x: Quorum-based repl with vector clocks based on Vnode Ids • Riak 2.0: CRDTs and DVVs • Riak 2.x: MultiPaxos • Machi: Chain Replication

Slide 4

Slide 4 text

“High” Availability • One of CAP Theorem indications: • No database can keep consistency (of atomic objects) and availability under network partition • One of FLP impossibility indications: • No consensus; Atomic objects (nodes) cannot be maintained consistent under failure of majority and delay • Network is not reliable[7] and no database has tried to tolerate these failures before Dynamo and Riak

Slide 5

Slide 5 text

Causal Consistency • Causally related writes must be seen in the same order, while concurrent writes may be seen in different order in different observers • Alternative guarantee in available system that tolerates any kind of network partition • Works under network partition, failures, delay, whatever, because it does not need coordination (vs Strong Consistency) • Allow multiple values but tracking causality of those values matters

Slide 6

Slide 6 text

Replication in Riak; track the causality R1 R2 R3 V1.0 V1.1 V1.0 V2.0 [V1.1, V2.0] C1 C2

Slide 7

Slide 7 text

Real Issues • Deletion, or KV679 • Sibling Explosion (Not VC explosion) • Read-before-write performance to track causality • Application semantics are unclear

Slide 8

Slide 8 text

Resolutions • Deletion, or KV679 -> Epoch, (GVV) • Sibling Explosion -> DVV • Read-before-write performance to track causality -> Write Once bucket • Application semantics are unclear -> CRDTs

Slide 9

Slide 9 text

Sibling Explosion • All data of concurrent PUTs must be kept as siblings • Caused by hot key, slow clients, network partition, failure, bugs, retry, etc • Even just by two clients! • Better than losing data, but screws up VM, GC, backend, etc C2

Slide 10

Slide 10 text

Sibling Explosion Example R1 R2 R3 put(B) C1 C2 B B put(C) [B,C] C put(D) put(E) [B,C,D] D [B,C,D,E] put(F) [B,C,D,E,F] E F

Slide 11

Slide 11 text

Causality graph with vanilla VV B [B,C] C [B,C,D] D [B,C,D,E] [B,C,D,E,F] E F

Slide 12

Slide 12 text

Dotted Version Vectors • Mark a sibling with “Dot” that indicates which actor has added that sibling. • B can be overwritten by D, D for F, C for E

Slide 13

Slide 13 text

With DVV R1 R2 R3 put(B) C1 C2 B B put(C) [B,C] C put(D) put(E) [C,D] D [D,E] put(F) [E,F] E F

Slide 14

Slide 14 text

Causality graph with DVV B [B,C] C [C,D] D [D,E] [E,F] E F

Slide 15

Slide 15 text

Microbenchmark 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 Number of Siblings Duration [sec] Number of siblings in DVV=false, 4 nodes concurrency=8 concurrency=16 concurrency=32 0 20 40 60 80 100 0 50 100 150 200 250 Number of Siblings Duration [sec] Number of siblings in DVV=true, 4 nodes concurrency=8 concurrency=16 concurrency=32 With sibling_benchmark.erl: http://bit.ly/1TPcwwc

Slide 16

Slide 16 text

Riak CS • A highly available dumb-ﬁle service that exposes Amazon S3 API • Breaks large object to 1MB chunks and stores as write-once registers of Riak keys • Chunks are write once registers, and each manifest points to single version described by UUID • ~10^3 TB or more

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Real Issues in Riak CS •real issues based on AP database •read IO traffic by… •AAE tree build •Garbage Collection and chunk deletion keeping track of causality •Slow disk space reclaim

Slide 19

Slide 19 text

Making Write Once Registers on top of Riak •Real issues based on AP database •Not a good design :P none data tombstone dead PUT DELETE reap read-before x2 read-before merge none

Slide 20

Slide 20 text

Machi • Chain replication with write-once registers, keys assigned by the system • No implicit overwrite happens • No need to track causality • Every byte has simple state transition: • unwritten => written => trimmed • Choose AP (eventual consistency) and CP (strong consistency) mode with single replication protocol

Slide 21

Slide 21 text

Machi API (Simpliﬁed) • append(Prefix, Bytes) => {Filename, Offset} • write(Filename, Offset, Bytes) => ok/error • read(Filename, Offset, Size) => Bytes • trim(Filename, Offset, Size) => ok/error

Slide 22

Slide 22 text

Chain Replication in Machi R1 R2 R3 C1 append(x) C2 read(x) trim(x)

Slide 23

Slide 23 text

Making Write Once Registers on top of Machi • No need to read-before-write • Need to remember assigned ﬁlename unwritten written trimmed append trim

Slide 24

Slide 24 text

AP and CP mode • In case of network partition… • AP mode • Split the chain and manipulate heads • Assign different names from same preﬁx • CP mode • Replicate until live nodes in chain is still majority [H, M1, M2, T] [H, M1, T], [M2] [H, M1] [M2, T]

Slide 25

Slide 25 text

Questions?

Slide 26

Slide 26 text

Important Resources • [1] Riak source code: https://github.com/basho/riak • [2] Machi source code: https://github.com/basho/machi • [3] CORFU: A Distributed Shared Log for Flash Clusters • http://research.microsoft.com/apps/pubs/default.aspx?id=157204 • [4] A Brief History of Time in Riak • https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak • https://www.youtube.com/watch?v=3SWSw3mKApM • [5] Managing Chain Replication with Humming Consensus • http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf • https://www.youtube.com/watch?v=yR5kHL1bu1Q • [6] Version Vectors are not Vector Clocks, [7] The Network is Reliable