Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Computer Wants to Lose Your Data

Avatar for Chris Sinjakli Chris Sinjakli
October 08, 2025
38

The Computer Wants to Lose Your Data

Storing data is something we expect computers to just do. When your application writes data to a database, you trust it to give you that data back later, but what does it take to make that reliable?

In this session, we'll explore the ways that computers can surprise us by failing to save or corrupting our data. We'll do this through the lens of databases, with a focus on MySQL and Postgres.

Specifically, we'll cover:

- The MySQL doublewrite buffer: the mechanism MySQL uses to guarantee writes make it safely to disk
- The Postgres fsyncgate incident: where the Postgres team realised that the guarantees around Linux's fsync syscall weren't as strong as they thought
- Write-through caches on disks: how manufacturers win benchmarks at the cost of data safety

We'll also look at how database replication can partially paper over these problems for us and the limits of what it can do.

Avatar for Chris Sinjakli

Chris Sinjakli

October 08, 2025
Tweet

Transcript

  1. INSERT INTO products VALUES ("sunglasses", 27.99); SELECT * FROM products;

    id | name | price ------------------------ 1 | sunglasses | 27.99
  2. INSERT INTO products VALUES ("sunglasses", 27.99); SELECT * FROM products;

    id | name | price ------------------------ 1 | sunglasses | 27.99
  3. INSERT INTO products VALUES ("sunglasses", 27.99); SELECT * FROM products;

    id | name | price ------------------------ 1 | sunglasses | 27.99 ? ? ?
  4. Hi

  5. Case studies - MySQL: doublewrite bu ff er - Postgres:

    fsyncgate - Disk: write-through caches
  6. Case studies - MySQL: doublewrite bu ff er - Postgres:

    fsyncgate - Disk: write-through caches
  7. Case studies - MySQL: doublewrite bu ff er - Postgres:

    fsyncgate - Disk: write-through caches
  8. Case studies - MySQL: doublewrite bu ff er - Postgres:

    fsyncgate - Disk: write-back caches
  9. The promise BEGIN; INSERT INTO products VALUES ("sunglasses", 27.99); INSERT

    INTO products VALUES ("jorts", 10.99); COMMIT; SELECT * FROM products; id | name | price ------------------------ 1 | sunglasses | 27.99 2 | jorts | 10.99
  10. The promise BEGIN; INSERT INTO products VALUES ("sunglasses", 27.99); INSERT

    INTO products VALUES ("jorts", 10.99); COMMIT; SELECT * FROM products; id | name | price ------------------------ 1 | sunglasses | 27.99 2 | jorts | 10.99
  11. The promise BEGIN; INSERT INTO products VALUES ("sunglasses", 27.99); INSERT

    INTO products VALUES ("jorts", 10.99); COMMIT; SELECT * FROM products; id | name | price ------------------------ 1 | sunglasses | 27.99 2 | jorts | 10.99 Data must be durable
  12. BEGIN; INSERT INTO products VALUES ("sunglasses", 27.99); INSERT INTO products

    VALUES ("jorts", 10.99); COMMIT; SELECT * FROM products; id | name | price ------------------------ 1 | sunglasses | 27.99 2 | jorts | 10.99 The promise Data must be durable All or nothing
  13. Table id name price 1 sunglasses 27.99 BEGIN; INSERT ("sunglasses",

    27.99); INSERT ("jorts", 10.99); COMMIT; SQL
  14. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL
  15. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL
  16. Table id name price 1 sunglasses 27.99 2 💥 💥

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL
  17. Table id name price 1 sunglasses 27.99 2 BEGIN; INSERT

    ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL Now what?
  18. Table id name price SQL WAL BEGIN TX 1 INS

    1 (1, ”sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1 BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT;
  19. Table id name price 1 sunglasses 27.99 BEGIN; INSERT ("sunglasses",

    27.99); INSERT ("jorts", 10.99); COMMIT; SQL WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1
  20. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1
  21. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1
  22. Table id name price 1 sunglasses 27.99 2 💥 💥

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1
  23. Table id name price WAL BEGIN TX 1 INS 1

    (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1 Never committed No partial data
  24. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    Up-to-date version of data WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1 Log of every operation
  25. . . . 16kB 16kB vs id name price 1

    sunglasses 27.99 2 jorts 10.99 16kB 16kB Logical Physical
  26. Old HDD 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB

    4kB 4kB RAM SSD 512b Pages vs Sectors
  27. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM SSD Pages vs Sectors 💥
  28. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  29. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  30. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  31. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  32. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  33. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  34. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  35. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  36. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  37. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table 💥
  38. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  39. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table 💥
  40. 16kB 4kB 4kB 4kB 4kB 16kB 4kB 4kB 4kB 4kB

    RAM Buf Doublewrite bu ff er 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB Table
  41. On restart: - Read doublewrite bu ff er pages -

    Copy to table if checksum good - Ignore if checksum bad
  42. On restart: - Read doublewrite bu ff er pages -

    Copy to table if checksum good - Ignore if checksum bad
  43. On restart: - Read doublewrite bu ff er pages -

    Copy to table if checksum good - Ignore if checksum bad
  44. On restart: - Read doublewrite bu ff er pages -

    Copy to table if checksum good - Ignore if checksum bad
  45. 16kB 16kB 16kB RAM ZFS (16kB recordsize) 4kB 4kB 4kB

    4kB 4kB 4kB 4kB 4kB SSD 16kB ZFS
  46. On Linux $ sudo nvme id-ctrl /dev/nvme0n1 | grep awupf

    awupf : 0 $ # Not a fancy drive $ # 1 sector -> 4kB atomicity $
  47. On Linux $ sudo nvme id-ctrl /dev/nvme0n1 | grep awupf

    awupf : 3 $ # Fancy drive :D $ # 4 sectors -> 16kB atomicity $
  48. man 2 fsync fsync(2) System Calls Manual fsync(2) NAME fsync,

    fdatasync - synchronize a file's in-core state with storage device DESCRIPTION fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other perma ‐ nent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This in ‐ cludes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
  49. man 2 fsync (simpli fi ed) fsync(2) System Calls Manual

    fsync(2) NAME fsync, fdatasync - synchronize a file's in-core state with storage device DESCRIPTION fsync() transfers all modified data of the file to the disk device so that all changed information can be retrieved even if the system crashes or is rebooted. The call blocks until the device reports that the transfer has completed.
  50. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync fsync 8kB 8kB 8kB 8kB
  51. Postgres mailing list From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: PostgreSQL Hackers

    <pgsql-hackers(at)postgresql(dot)org> Subject: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-28 02:23:46 Message-ID: CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com Lists: pgsql-hackers Hi all Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error. TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk". ...
  52. Postgres mailing list From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: PostgreSQL Hackers

    <pgsql-hackers(at)postgresql(dot)org> Subject: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-28 02:23:46 Message-ID: CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com Lists: pgsql-hackers Hi all Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error. TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk". ...
  53. man 2 fsync fsync(2) System Calls Manual fsync(2) ERRORS The

    fsync() system call will fail if: ... [EIO] An error occurred during synchronization. This error may relate to data written to some other file descriptor on the same file. ...
  54. Postgres mailing list From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: PostgreSQL Hackers

    <pgsql-hackers(at)postgresql(dot)org> Subject: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-28 02:23:46 Message-ID: CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com Lists: pgsql-hackers Hi all Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error. TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk". ...
  55. fsync pseudocode file = File.open("/data/base") file.write("some data") err = file.fsync

    # Mark the data for a retry if !err.nil? && err.type == EIO file.mark_for_retry end
  56. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  57. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  58. fsync pseudocode file = File.open("/data/base") file.write("some data") err = file.fsync

    # Mark the data for a retry if !err.nil? && err.type == EIO file.mark_for_retry end
  59. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  60. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  61. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  62. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  63. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥
  64. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥 fsync
  65. Postgres mailing list From: Craig Ringer <craig(at)2ndquadrant(dot)com> To: PostgreSQL Hackers

    <pgsql-hackers(at)postgresql(dot)org> Subject: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS Date: 2018-03-28 02:23:46 Message-ID: CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com Lists: pgsql-hackers Hi all Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error. TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk". ...
  66. Table id name price 1 sunglasses 27.99 2 jorts 10.99

    BEGIN; INSERT ("sunglasses", 27.99); INSERT ("jorts", 10.99); COMMIT; SQL WAL BEGIN TX 1 INS 1 (1, "sunglasses", 27.99) INS 1 (2, "jorts", 10.99) COMMIT TX 1
  67. 4kB 4kB 4kB 4kB 4kB 4kB 4kB 4kB RAM SSD

    fsync 8kB 8kB 8kB 8kB 💥fsync💥 fsync
  68. Fix for bad drives # Disable write-back caching # #

    Might fix flush on a bad drive, # but ruin performance hdparm -W 0 /dev/sda
  69. - Storage is unforgiving - Higher layers can't fi x

    lower ones - Slower is easier Lessons
  70. - Storage is unforgiving - Higher layers can't fi x

    lower ones - Slower is easier Lessons
  71. - Storage is unforgiving - Higher layers can't fi x

    lower ones - Slower is easier Lessons
  72. Image credits • Twemoji Floppy Disk Emoji - CC-BY -

    https://github.com/twitter/twemoji/blob/ d94f4cf793e6d5ca592aa00f58a88f6a4229ad43/assets/svg/1f4be.svg • Hard Disk Guts - CC-BY - https://www. fl ickr.com/photos/mattandkim/97533589/ • Yellow Slippery Road Signage - CC0 - https://www.pexels.com/photo/sign-slippery-wet- caution-4341/ • Black and Green Circuit board - CC0 - https://www.pexels.com/photo/black-and-green- circuit-board-2644597/ • Corsair ForceGT 180GB - CC-BY - https://www. fl ickr.com/photos/ruocaled/8173124575/
  73. Image credits • High Performance NVMe SSD on Gray Surface

    - CC0 - https://www.pexels.com/photo/ high-performance-nvme-ssd-on-gray-surface-28666524/ • Server Guts - CC-BY - https://www. fl ickr.com/photos/chrisdag/2142582850 • Ceramic capacitors mounted on a PCB - CC-BY - https://commons.wikimedia.org/w/ index.php?curid=113868467 • NOIRLab HQ Server Racks - CC-BY - https://commons.wikimedia.org/wiki/ File:NOIRLab_HQ_Server_Racks_(6V6A0402-CC).jpg