Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rewriting binary logs because your database broke

Rewriting binary logs because your database broke

Chris Sinjakli

June 18, 2020
Tweet

More Decks by Chris Sinjakli

Other Decks in Programming

Transcript

  1. Hi

  2. @ChrisSinjo

  3. @ChrisSinjo

  4. @ChrisSinjo

  5. Rewriting binary logs because your database broke @ChrisSinjo

  6. Site Reliability Engineer

  7. Site Reliability Engineer (a Software Engineer who cares more about

    systems…ish)
  8. None
  9. POST /cash/monies HTTP/1.1 { amount: 100 }

  10. High per-request

  11. Uptime is

  12. One part of a production incident

  13. None
  14. https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th- october/

  15. Postgres Postgres Replication Application

  16. Postgres Postgres Replication Application

  17. Postgres Postgres Application

  18. Postgres Postgres Application

  19. Many parts to that investigation

  20. Too many for one talk

  21. Focus: Debugging parts of the system you don’t normally touch

  22. Focus: Reproduce what happened with Postgres’ binary logs during that

    incident
  23. What is a binary log?

  24. [2020-06-14 23:02:37Z] GET / [2020-06-14 23:02:49Z] GET /favicon.ico [2020-06-14 23:03:53Z]

    GET /posts/binary-logs—talk What we normally mean by logs
  25. Databases have a different kind of log

  26. INSERT INTO talks VALUES ('binary-logs'); INSERT INTO talks VALUES ('niche-git-features');

    INSERT INTO talks VALUES ('fp-karaoke'); Some extremely boring SQL
  27. None
  28. None
  29. Warning: simplifying lie ahead

  30. INSERT INTO talks VALUES ('binary-logs'); INSERT INTO talks VALUES ('niche-git-features');

    INSERT INTO talks VALUES ('fp-karaoke'); ‑ Wrote 'binary-logs' into table 'talks' Wrote 'niche-git-features' into table 'talks' Wrote 'fp-karaoke' into table 'talks' A different kind of logs (if they were textual)
  31. But why bother doing that?

  32. Crash safety

  33. Index Table id talk_name 1 binary-logs 2 niche-git-features id 1

    2
  34. Index Table id talk_name 1 binary-logs 2 niche-git-features 3 fp-karaoke

    id 1 2
  35. Index Table id talk_name 1 binary-logs 2 niche-git-features 3 fp-karaoke

    id 1 2 .
  36. Index Table id talk_name 1 binary-logs 2 niche-git-features 3 fp-karaoke

    id 1 2 ???
  37. INSERT INTO talks VALUES ('binary-logs'); INSERT INTO talks VALUES ('niche-git-features');

    INSERT INTO talks VALUES ('fp-karaoke'); ‑ Wrote 'binary-logs' into table 'talks' Wrote 'niche-git-features' into table 'talks' Wrote 'fp-karaoke' into table 'talks' We can replay this operation
  38. Index Table id talk_name 1 binary-logs 2 niche-git-features 3 fp-karaoke

    id 1 2 ???
  39. Index Table id talk_name 1 binary-logs 2 niche-git-features 3 fp-karaoke

    id 1 2 3
  40. Also: replication

  41. Postgres Postgres Replication Application

  42. Postgres Postgres Replication Application (binary logs)

  43. So why the interest in rewriting the binary logs?

  44. 2020-06-14 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2020-06-14 17:23:02 GMT LOG: invalid record length at 0/3000180 A suspicious error message
  45. Issue restoring binary log ‑ Cause of failure to promote

    replica?
  46. Goal: do this repeatably in a non- production cluster

  47. 2020-06-14 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

    2020-06-14 17:23:02 GMT LOG: invalid record length at 0/3000180 Goal: produce this error message
  48. We can cheat: Postgres is open source

  49. But!

  50. These techniques also work on closed source software

  51. We just call that reverse engineering

  52. $ git checkout REL9_4_26 # we were running 9.4 $

    git grep -n "invalid record length" src/backend/access/transam/xlogreader.c:295: [...] src/backend/access/transam/xlogreader.c:604: [...] src/backend/access/transam/xlogreader.c:678: [...] Let's find the error
  53. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } Let's find the error
  54. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } Let's find the error
  55. src/include/access/xlog.h:58: #define SizeOfXLogRecord MAXALIGN(sizeof(XLogRecord)) Let's find the error

  56. Wouldn't it be convenient if we could make total_len ==

    0?
  57. src/backend/access/transam/xlogreader.c:272-273: record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);

    total_len = record->xl_tot_len; Let's find the error
  58. src/include/access/xlog.h:41: typedef struct XLogRecord { uint32 xl_tot_len; /* total len

    of entire record */ TransactionId xl_xid; /* xact id */ uint32 xl_len; /* total len of rmgr data */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ XLogRecPtr xl_prev; /* ptr to previous record in log */ pg_crc32 xl_crc; /* CRC for this record */ /* If MAXALIGN==8, there are 4 wasted bytes here */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ } XLogRecord; Let's find the error
  59. src/include/access/xlog.h:41-56: typedef struct XLogRecord { uint32 xl_tot_len; /* total len

    of entire record */ TransactionId xl_xid; /* xact id */ uint32 xl_len; /* total len of rmgr data */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ XLogRecPtr xl_prev; /* ptr to previous record in log */ pg_crc32 xl_crc; /* CRC for this record */ /* If MAXALIGN==8, there are 4 wasted bytes here */ /* ACTUAL LOG DATA FOLLOWS AT END OF STRUCT */ } XLogRecord; Let's find the error
  60. None
  61. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } What was that check doing?
  62. src/backend/access/transam/xlogreader.c:291-300: { /* XXX: more validation should be done here

    */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X", (uint32) (RecPtr >> 32), (uint32) RecPtr); goto err; } gotheader = false; } What was that check doing? Size the record says it is Smallest possible size it can be
  63. INSERT INTO talks VALUES ('binary-logs'); INSERT INTO talks VALUES ('niche-git-features');

    INSERT INTO talks VALUES ('fp-karaoke'); ‑ Wrote 'binary-logs' into table 'talks' Wrote 'niche-git-features' into table 'talks' Wrote 'fp-karaoke' into table 'talks' A different kind of logs (if they were textual)
  64. Let's see what they look like in practice

  65. INSERT INTO talks VALUES ('binary-logs'); INSERT INTO talks VALUES ('niche-git-features');

    INSERT INTO talks VALUES ('fp-karaoke'); Some extremely boring SQL
  66. Grab the binary log file, and...

  67. A barely comprehensible wall of data

  68. A barely comprehensible wall of data Hex ASCII

  69. A little help: ASCII codes Decimal Hexadecimal Character 62 3E

    > 63 3F ? 64 40 @ 65 41 A 66 42 B
  70. A little help: ASCII codes Decimal Hexadecimal Character 62 3E

    > 63 3F ? 64 40 @ 65 41 A 66 42 B
  71. A barely comprehensible wall of data Hex ASCII

  72. A barely comprehensible wall of data We can see our

    talks!!
  73. How can we find xl_tot_len?

  74. INSERT INTO repro VALUES ('A'); INSERT INTO repro VALUES ('AB');

    INSERT INTO repro VALUES ('ABC'); INSERT INTO repro VALUES ('ABCD'); INSERT INTO repro VALUES ('ABCDE'); ... Some even more boring SQL
  75. Look for a field increasing by 1

  76. Guesswork incoming!

  77. Guesswork incoming! The data we inserted

  78. A little help: ASCII codes Decimal Hexadecimal Character 62 3E

    > 63 3F ? 64 40 @ 65 41 A 66 42 B
  79. Notice anything? The data we inserted

  80. Notice anything? The data we inserted Familiar characters

  81. Notice anything? Decimal Hexadecimal Character 63 3F ? 64 40

    @ 65 41 A The data we inserted Familiar characters
  82. Notice anything? The data we inserted Familiar characters

  83. Notice anything? The data we inserted Familiar characters

  84. Notice anything? The data we inserted Familiar characters Familiar characters

    (hex) The data we inserted (hex)
  85. Wouldn't it be convenient if we could make total_len ==

    0?
  86. We could import the Postgres structs and do this properly...

  87. ...or we could write a regex

  88. Let's write a regex

  89. None
  90. Let's break this one

  91. binlog_file_name = ARGV[0] puts binlog_file_name binlog_contents = IO.read(binlog_file_name, encoding: "BINARY")

    hex = binlog_contents.unpack("H*").first replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1") bindata = [replaced].pack("H*") File.write(binlog_file_name + ".broken", bindata) break_binlog.rb
  92. binlog_file_name = ARGV[0] puts binlog_file_name binlog_contents = IO.read(binlog_file_name, encoding: "BINARY")

    hex = binlog_contents.unpack("H*").first replaced = hex.gsub(/3f(000000.+41424300)/, "00\\1") # Replaces 'ABC' size bindata = [replaced].pack("H*") File.write(binlog_file_name + ".broken", bindata) break_binlog.rb
  93. Let's break this one

  94. Broken!!

  95. And if we give it to a Postgres replica?

  96. 2020-06-17 19:24:11 GMT LOG: restored log file "000000020000000000000003" from archive

    2020-06-17 19:24:11 GMT LOG: invalid record length at 0/3000148 We reproduced the error!
  97. Success

  98. Success, with a caveat...

  99. Postgres Postgres Application

  100. https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th- october/

  101. Lessons

  102. Hopefully you learned some cool stuff about databases

  103. Debugging unfamiliar layers of a system

  104. No part of software is magic

  105. Thank you ✌❤ @ChrisSinjo @GoCardlessEng

  106. Image credits • Hard Disk Guts - CC-BY - https://www.flickr.com/photos/mattandkim/97533589/

    • Corsair ForceGT 180GB - CC-BY - https://www.flickr.com/photos/ruocaled/8173124575/ • Rope - CC-BY - https://www.flickr.com/photos/49140926@N07/6798304070/
  107. Questions? ✌❤ @ChrisSinjo @GoCardlessEng