Replication: Never Lose Data •Hold multiple copies to tolerate media failure •If a copy is lost, recover it from remaining copies •Keep all of them up to date, preventing from implicit overwrite
History of Replication ɹ…as of Basho • Riak 0.x: Quorum-based repl with vector clocks based on Client Ids • Riak 1.x: Quorum-based repl with vector clocks based on Vnode Ids • Riak 2.0: CRDTs and DVVs • Riak 2.x: MultiPaxos • Machi: Chain Replication
“High” Availability • One of CAP Theorem indications: • No database can keep consistency (of atomic objects) and availability under network partition • One of FLP impossibility indications: • No consensus; Atomic objects (nodes) cannot be maintained consistent under failure of majority and delay • Network is not reliable[7] and no database has tried to tolerate these failures before Dynamo and Riak
Causal Consistency • Causally related writes must be seen in the same order, while concurrent writes may be seen in different order in different observers • Alternative guarantee in available system that tolerates any kind of network partition • Works under network partition, failures, delay, whatever, because it does not need coordination (vs Strong Consistency) • Allow multiple values but tracking causality of those values matters
Sibling Explosion • All data of concurrent PUTs must be kept as siblings • Caused by hot key, slow clients, network partition, failure, bugs, retry, etc • Even just by two clients! • Better than losing data, but screws up VM, GC, backend, etc C2
Riak CS • A highly available dumb-file service that exposes Amazon S3 API • Breaks large object to 1MB chunks and stores as write-once registers of Riak keys • Chunks are write once registers, and each manifest points to single version described by UUID • ~10^3 TB or more
Real Issues in Riak CS •real issues based on AP database •read IO traffic by… •AAE tree build •Garbage Collection and chunk deletion keeping track of causality •Slow disk space reclaim
Making Write Once Registers on top of Riak •Real issues based on AP database •Not a good design :P none data tombstone dead PUT DELETE reap read-before x2 read-before merge none
Machi • Chain replication with write-once registers, keys assigned by the system • No implicit overwrite happens • No need to track causality • Every byte has simple state transition: • unwritten => written => trimmed • Choose AP (eventual consistency) and CP (strong consistency) mode with single replication protocol
AP and CP mode • In case of network partition… • AP mode • Split the chain and manipulate heads • Assign different names from same prefix • CP mode • Replicate until live nodes in chain is still majority [H, M1, M2, T] [H, M1, T], [M2] [H, M1] [M2, T]
Important Resources • [1] Riak source code: https://github.com/basho/riak • [2] Machi source code: https://github.com/basho/machi • [3] CORFU: A Distributed Shared Log for Flash Clusters • http://research.microsoft.com/apps/pubs/default.aspx?id=157204 • [4] A Brief History of Time in Riak • https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak • https://www.youtube.com/watch?v=3SWSw3mKApM • [5] Managing Chain Replication with Humming Consensus • http://ricon.io/speakers/slides/Scott_Fritchie_Ricon_2015.pdf • https://www.youtube.com/watch?v=yR5kHL1bu1Q • [6] Version Vectors are not Vector Clocks, [7] The Network is Reliable