Slide 1

Slide 1 text

What Ever Happened to Durability? Tom Lyon Founder & Chief Scientist, DriveScale @aka_pugs

Slide 2

Slide 2 text

Durability & ACID §  Atomicity §  Consistency §  Isolation §  Durability

Slide 3

Slide 3 text

Durability Formalized

Slide 4

Slide 4 text

Durability Defined §  If written data is acknowledged, it must be forever readable §  If written data is read once [before it is acknowledged], it must be forever readable

Slide 5

Slide 5 text

Nothing is Forever §  Hardware eventually fails §  Software eventually (?) works §  Durability is a matter of degree §  What is good enough?

Slide 6

Slide 6 text

Estimating Durability https://www.backblaze.com/blog/cloud-storage-durability/

Slide 7

Slide 7 text

Performance is the Enemy §  “The only good write is an O_SYNC write” §  Write-behind, caching, background compaction/migration can all lead to hidden errors §  fsync(2) can and should return errors, but misses some §  See https://wiki.postgresql.org/wiki/Fsync_Errors §  PostgreSQL: Caring about durability since 1986 §  “commit intervals”?

Slide 8

Slide 8 text

Can’t trust a File System “We analyze 11 applications, and find 60 vulnerabilities, some of which result in severe consequences like corruption or data loss.”

Slide 9

Slide 9 text

Can’t trust an SSD ‘Surprisingly, we find that 13 out of the 15 devices, including the supposedly “enterprise-class” devices, exhibit failure behavior contrary to our expectations’

Slide 10

Slide 10 text

Servers and Mayflies §  Back in the day, when “the” computer crashed, you just waited for repair §  Now you remove or re-image the server – with the drives §  Local durability is really hard, but no longer adequate

Slide 11

Slide 11 text

Replication §  Backups? Not timely §  Synchronous mirroring? Very expensive §  Just use the network! Make copies! Go forth and replicate! §  Losing a disk or server no longer causes lost data. Right? Who needs fsync?

Slide 12

Slide 12 text

Correlated Failures §  AWS can lose a data center, you can too §  Rack power problems are common §  The smaller your cluster, the more vulnerable it is https://xkcd.com/1737

Slide 13

Slide 13 text

It’s a Distributed System!

Slide 14

Slide 14 text

CAP Theorem §  You will have Partitioning. §  You must choose between Availability and Consistency. §  Your users will hate your choice. §  Availability can be improved by brute force and $$$ - to reduce partitioning. §  Consistency requires consensus.

Slide 15

Slide 15 text

Consensus is Hard

Slide 16

Slide 16 text

Jepsen breaks everything “Use Zookeeper. It’s mature, well-designed, and battle-tested.” “The etcd and Consul teams both take consistency seriously…” Kyle Kingsbury, https://jepsen.io

Slide 17

Slide 17 text

Logs & Journals §  Application first writes to log, then to where the data “really lives” §  FS writes to journal, then to where the data “really lives” §  Device writes to log, then to where the data “really lives” §  What if “the truth” “really lived” in the log? §  The other places become read caches

Slide 18

Slide 18 text

Table and Stream Duality §  “A table is just a cache of the latest value for each key in a stream” – P. Helland §  Logs are great for streaming data §  What if the log itself is distributed and allows many writers and readers?

Slide 19

Slide 19 text

Streaming Systems §  Apache Kafka §  60 second “commit interval?” §  Apache Pulsar §  Uses Apache Bookkeeper §  Distributed Logs: §  Apache DistributedLog – uses Bookkeeper §  Facebook LogDevice

Slide 20

Slide 20 text

Apache Bookkeeper™ §  “A scaleable, fault-tolerant, and low- latency storage service optimized for real-time workloads” §  Guarantees: §  “If an entry has been acknowledged, it must be readable” §  “If an entry has been read once, it must always be readable”

Slide 21

Slide 21 text

Bookkeeper Components §  Client-side library §  Distributed Ledger Abstraction §  “Bookie” – very simple storage nodes §  Bookies do NOT talk to each other §  Zookeeper coordination, consensus, cluster membership, and quorums

Slide 22

Slide 22 text

Bookkeeper Data Flow Bookies Apps

Slide 23

Slide 23 text

Planet Java §  Zookeeper and Bookkeeper are both from planet Java §  How about something more friendly to Planet Linux? §  Use etcd, rewrite Bookkeeper like ScyllaDB did for Cassandra?

Slide 24

Slide 24 text

Take-aways §  Durability is Hard §  Distributed Durability is Very Hard §  Be Up-Front about your durability model §  Logs as Truth & Streaming are the future §  Apache Bookkeeper is awesome §  Don’t re-invent the wheel!

Slide 25

Slide 25 text

Q & A Software Composable Infrastructure for modern workloads and commodity hardware.