Everything is Okay: Database Edition

EVERYTHING IS OKAY

EVERYTHING IS OKAY: DATABASE EDITION

"A distributed system is one in which the failure of
a computer you didn't even know existed can render your own computer unusable." — Leslie Lamport, 1987

The network is reliable Latency is zero Bandwidth is infinite
The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous Everything is awesome

Consistency: all nodes see the same data Availability: all requests
get a response Partition tolerance: nodes survive arbitrary message loss

CP / AP / CA

CP / AP

YOU CAN'T SACRIFICE PARTITION TOLERANCE.

The network is reliable Latency is zero Bandwidth is infinite
The network is secure Topology doesn't change There is one administrator Transport cost is zero The network is homogeneous Everything is awesome

A typical first year for a new Google cluster: 40-80
machines with 50% packet loss dozens of DNS blips 12 router restarts 8 network maintenance events 3 router failures 1 network rewiring ...and that's just LAN incidents

YOU CAN'T SACRIFICE PARTITION TOLERANCE.

YOU CAN'T SACRIFICE PARTITION

"Despite your best efforts, your system will experience enough faults
that it will have to make a choice between reducing yield (i.e., stop answering requests) and reducing harvest (i.e., giving answers based on incomplete data). This decision should be based on business requirements."

CP: When faced with a network partition, stop answering requests
AP: When faced with a network partition, answer requests using incomplete data

Make your choice, but it might still suck.

Consistency: really freaking hard Availability: okay I guess Partition tolerance:
really freaking hard

PAXOS (1998)

PAXOS & CHUBBY (2007)

"There are significant gaps between the description of the Paxos
algorithm and the needs of a real-world system. In order to build a real- world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol."

"The fault-tolerance computing community has not developed the tools to
make it easy to implement their algorithms."

"The fault-tolerance computing community has not paid enough attention to
testing, a key ingredient for building fault-tolerant systems."

PAXOS & MEGASTORE (2011)

"While many systems use Paxos solely for locking, master election,
or replication of metadata and configurations, we believe that Megastore is the largest system deployed that uses Paxos to replicate primary user data across datacenters on every write."

"The original Paxos algorithm is ill-suited for high-latency network links
because it demands multiple rounds of communication."

"Megastore is perhaps the first large-scale storage system to implement
Paxos-based replication across datacenters while satisfying the scalability and performance requirements of scalable web applications in the cloud."

RAFT (2014)

"In Search of an Understandable Consensus Algorithm"

JEPSEN (2013—)

PostgreSQL (CP) "...we cannot assert the state of the system
for these writes."

Redis (CP) "Redis threw away 56% of the writes it
told us succeeded."

MongoDB (CP and AP???) "MongoDB is neither AP nor CP.
The defaults can cause significant loss of acknowledged writes. The strongest consistency offered has bugs which cause false acknowledgements..."

Riak (AP) "Riak’s last-write-wins is fundamentally unsafe in the presence
of network partitions."

Zookeeper (CP) "Use Zookeeper...wherever possible [taking] advantage of tested recipes
and client libraries..."

NuoDB (CAP counterexample???) "...most, but not all, writes made during
the partition failed...and were not retried"

Kafka (CA???) "Kafka’s replication claimed to be CA, but in
the presence of a partition, threw away an arbitrarily large volume of committed writes."

Cassandra (AP) "Cassandra lightweight transactions are not even close to
correct."

RabbitMQ (CP or AP) "...in the presence of partitions, RabbitMQ
clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor."

etcd and Consul (CP) "consistent reads...allow stale reads"

Elasticsearch (CP) "Elasticsearch appears to lose writes...during asymmetric partitions, symmetric
partitions, overlapping partitions, disjoint partitions, and even partitions which only isolate a single node once. Its convergence times are slow and the cluster can repeatably deadlock, forcing an administrator to intervene before recovery...If you are an Elasticsearch user (as I am): good luck."

Join a monastery Pay Google Hope your site never gets
popular Choose CP or AP, read aphyr.com, and hope for the best

http://aphyr.com http://codahale.com/you-cant-sacrifice-partition-tolerance http://research.google.com (Chubby, Megastore) https://ramcloud.stanford.edu/~ongaro/thesis.pdf http://www.somethingsimilar.com/2013/01/14/ notes-on-distributed-systems-for-young-bloods

Everything is Okay: Database Edition

Everything is Okay: Database Edition

More Decks by Dylan Vassallo

Other Decks in Programming

Featured

Transcript