BDI SIG tech talk

Replication in Basho Products, and the latest one: Machi 2015/12/18
BDI SIG Kota UENISHI Basho Japan KK

Replication: Never Lose Data •Hold multiple copies to tolerate media
failure •If a copy is lost, recover it from remaining copies •Keep all of them up to date, preventing from implicit overwrite

History of Replication ɹ…as of Basho • Riak 0.x: Quorum-based
repl with vector clocks based on Client Ids • Riak 1.x: Quorum-based repl with vector clocks based on Vnode Ids • Riak 2.0: CRDTs and DVVs • Riak 2.x: MultiPaxos • Machi: Chain Replication

“High” Availability • One of CAP Theorem indications: • No
database can keep consistency (of atomic objects) and availability under network partition • One of FLP impossibility indications: • No consensus; Atomic objects (nodes) cannot be maintained consistent under failure of majority and delay • Network is not reliable[7] and no database has tried to tolerate these failures before Dynamo and Riak

Causal Consistency • Causally related writes must be seen in
the same order, while concurrent writes may be seen in different order in different observers • Alternative guarantee in available system that tolerates any kind of network partition • Works under network partition, failures, delay, whatever, because it does not need coordination (vs Strong Consistency) • Allow multiple values but tracking causality of those values matters

Replication in Riak; track the causality R1 R2 R3 V1.0
V1.1 V1.0 V2.0 [V1.1, V2.0] C1 C2

Real Issues • Deletion, or KV679 • Sibling Explosion (Not
VC explosion) • Read-before-write performance to track causality • Application semantics are unclear

Resolutions • Deletion, or KV679 -> Epoch, (GVV) • Sibling
Explosion -> DVV • Read-before-write performance to track causality -> Write Once bucket • Application semantics are unclear -> CRDTs

Sibling Explosion • All data of concurrent PUTs must be
kept as siblings • Caused by hot key, slow clients, network partition, failure, bugs, retry, etc • Even just by two clients! • Better than losing data, but screws up VM, GC, backend, etc C2

Sibling Explosion Example R1 R2 R3 put(B) C1 C2 B
B put(C) [B,C] C put(D) put(E) [B,C,D] D [B,C,D,E] put(F) [B,C,D,E,F] E F

Causality graph with vanilla VV B [B,C] C [B,C,D] D
[B,C,D,E] [B,C,D,E,F] E F

Dotted Version Vectors • Mark a sibling with “Dot” that
indicates which actor has added that sibling. • B can be overwritten by D, D for F, C for E

With DVV R1 R2 R3 put(B) C1 C2 B B
put(C) [B,C] C put(D) put(E) [C,D] D [D,E] put(F) [E,F] E F

Causality graph with DVV B [B,C] C [C,D] D [D,E]
[E,F] E F

Microbenchmark 0 20 40 60 80 100 0 10 20
30 40 50 60 70 80 Number of Siblings Duration [sec] Number of siblings in DVV=false, 4 nodes concurrency=8 concurrency=16 concurrency=32 0 20 40 60 80 100 0 50 100 150 200 250 Number of Siblings Duration [sec] Number of siblings in DVV=true, 4 nodes concurrency=8 concurrency=16 concurrency=32 With sibling_benchmark.erl: http://bit.ly/1TPcwwc

Riak CS • A highly available dumb-ﬁle service that exposes
Amazon S3 API • Breaks large object to 1MB chunks and stores as write-once registers of Riak keys • Chunks are write once registers, and each manifest points to single version described by UUID • ~10^3 TB or more

Real Issues in Riak CS •real issues based on AP
database •read IO traffic by… •AAE tree build •Garbage Collection and chunk deletion keeping track of causality •Slow disk space reclaim

Making Write Once Registers on top of Riak •Real issues
based on AP database •Not a good design :P none data tombstone dead PUT DELETE reap read-before x2 read-before merge none

Machi • Chain replication with write-once registers, keys assigned by
the system • No implicit overwrite happens • No need to track causality • Every byte has simple state transition: • unwritten => written => trimmed • Choose AP (eventual consistency) and CP (strong consistency) mode with single replication protocol

Machi API (Simpliﬁed) • append(Prefix, Bytes) => {Filename, Offset} •
write(Filename, Offset, Bytes) => ok/error • read(Filename, Offset, Size) => Bytes • trim(Filename, Offset, Size) => ok/error

Chain Replication in Machi R1 R2 R3 C1 append(x) C2
read(x) trim(x)

Making Write Once Registers on top of Machi • No
need to read-before-write • Need to remember assigned ﬁlename unwritten written trimmed append trim

AP and CP mode • In case of network partition…
• AP mode • Split the chain and manipulate heads • Assign different names from same preﬁx • CP mode • Replicate until live nodes in chain is still majority [H, M1, M2, T] [H, M1, T], [M2] [H, M1] [M2, T]

Questions?

Important Resources • [1] Riak source code: https://github.com/basho/riak • [2]
Machi source code: https://github.com/basho/machi • [3] CORFU: A Distributed Shared Log for Flash Clusters • http://research.microsoft.com/apps/pubs/default.aspx?id=157204 • [4] A Brief History of Time in Riak • https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak • https://www.youtube.com/watch?v=3SWSw3mKApM • [5] Managing Chain Replication with Humming Consensus • http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf • https://www.youtube.com/watch?v=yR5kHL1bu1Q • [6] Version Vectors are not Vector Clocks, [7] The Network is Reliable

BDI SIG tech talk

BDI SIG tech talk

UENISHI Kota

More Decks by UENISHI Kota

Other Decks in Technology

Featured

Transcript

Replication in Basho Products, and the latest one: Machi 2015/12/18

Replication: Never Lose Data •Hold multiple copies to tolerate media

History of Replication ɹ…as of Basho • Riak 0.x: Quorum-based

“High” Availability • One of CAP Theorem indications: • No

Causal Consistency • Causally related writes must be seen in

Replication in Riak; track the causality R1 R2 R3 V1.0

Real Issues • Deletion, or KV679 • Sibling Explosion (Not

Resolutions • Deletion, or KV679 -> Epoch, (GVV) • Sibling

Sibling Explosion • All data of concurrent PUTs must be

Sibling Explosion Example R1 R2 R3 put(B) C1 C2 B

Causality graph with vanilla VV B [B,C] C [B,C,D] D

Dotted Version Vectors • Mark a sibling with “Dot” that

With DVV R1 R2 R3 put(B) C1 C2 B B

Causality graph with DVV B [B,C] C [C,D] D [D,E]

Microbenchmark 0 20 40 60 80 100 0 10 20

Riak CS • A highly available dumb-ﬁle service that exposes

Real Issues in Riak CS •real issues based on AP

Making Write Once Registers on top of Riak •Real issues

Machi • Chain replication with write-once registers, keys assigned by

Machi API (Simpliﬁed) • append(Prefix, Bytes) => {Filename, Offset} •

Chain Replication in Machi R1 R2 R3 C1 append(x) C2

Making Write Once Registers on top of Machi • No

AP and CP mode • In case of network partition…

Questions?

Important Resources • [1] Riak source code: https://github.com/basho/riak • [2]