Un-split brain MySQL

168ccec72eee0530b818d44f3fedaacf?s=47 Shlomi Noach
February 02, 2019

Un-split brain MySQL

Is there a way to amend a MySQL split brain scenario? When two servers are diverged from each other, is it possible to identify and undo the conflicting changes?

We introduce gh-mysql-rewind, which combines multiple technologies to achieve auto-resolution of data divergence. This presentation explains how gh-mysql-rewind works, and how it is being tested in production to validate its operation.

168ccec72eee0530b818d44f3fedaacf?s=128

Shlomi Noach

February 02, 2019
Tweet

Transcript

  1. MySQL Un-split brain
 aka Move Back in Time via gh-mysql-rewind

    Shlomi Noach GitHub FOSDEM 2019
  2. About me @github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and

    others. Blog at http://openark.org 
 github.com/shlomi-noach
 @ShlomiNoach
  3. GitHub
 Built for developers Largest open source hosting 100M+ repositories,

    31M+ developers, 
 2.1M+ organizations Supplier of octocat T-Shirts and stickers
  4. Incentive Expected MySQL split brain, with unexpected requirement.

  5. ! ! ! ! ! ! ! ! ! !

    ! ! ! ! ! ! ! ! ! ! ! ! ! ! Failover & split brain
  6. Incentive The time it took to restore data on the

    demoted topology
  7. Incentive Could we just roll back the changes and move

    back in time?
  8. gh-mysql-rewind Rewind a dirty server back in time and connect

    it as a healthy replica in a replication topology.
  9. Iteratively restore hosts into replication topology ! ! ! !

    ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
  10. gh-mysql-rewind How?

  11. GTID The good parts

  12. GTID Each server keeps track of: • gtid_executed: all transactions

    ever executed:
 00020192-1111-1111-1111-111111111111:1-130541 • gtid_purged: of which, some may have been purged:
 00020192-1111-1111-1111-111111111111:1-120008
  13. ! ! ! Binary logs ! ! ! Binary logs

    Split brain: GTID contents ! ! ! ! ! ! Binary logs
  14. 00020192-1111-1111-1111-111111111111:1-5042 Split brain: gtid_executed gtid_executed on demoted master:

  15. 00020192-1111-1111-1111-111111111111:1-5000,
 00020193-2222-2222-2222-222222222222:1-200 Split brain: gtid_executed gtid_executed on promoted master:

  16. 00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:5001-5042 00020192-1111-1111-1111-111111111111:1-5000,
 00020193-2222-2222-2222-222222222222:1-200 Identifying bad GTID transactions gtid_executed

    (demoted) - gtid_executed (promoted)
  17. Row based replication With binlog_row_image=FULL: Each INSERT, DELETE, UPDATE is

    expressed in the binary log as the complete row image before/after change.
  18. Row based replication BINLOG ' L9MtXBMBAAAANgAAAJAWAAAAAG4AAAAAAAEABG1ldGEACWhlYXJ0YmVhdAACAxEBBgCk9AE9 L9MtXB8BAAAAPAAAAMwWAAAAAG4AAAAAAAEAAgAC///8AQAAAFwt0y4BQYr8AQAAAFwt0y8BWKiT IZN+ '/*!*/; ###

    UPDATE `meta`.`heartbeat` ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507054.082314 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ ### SET ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507055.088232 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ # at 5836 #190103 11:17:35 server id 1 end_log_pos 5867 CRC32 0x2cf60376 Xid = 114 COMMIT/*!*/;
  19. MariaDB, flashback Developed by Alibaba Contributed to MySQL and MariaDB

    Implemented in MariaDB’s mysqlbinlog: • mysqlbinlog --flashback
  20. flashback example, pseudo code insert(1, 'a') insert(2, 'b') insert(3, 'c')

    update(2, 'b')->(2, 'second') update(3, 'c')->(3, 'third') insert(4, 'd') delete(1, 'a') insert(1, 'a') delete(4, 'd') update(3, 'third')->(3, 'c') update(2, 'second')->(2, 'b') delete(3, 'c') delete(2, 'b') delete(1, 'a')
  21. rewind GTID provides information on “what diverged”. flashback provides the

    mechanics to undo changes. Now, what exactly do we need to rewind?
  22. Time axis ! !

  23. Rewind to split point ! !

  24. Rewind beyond split point ! !

  25. ! ! ! Binary logs Finding bad transactions in binlogs

    mysql-bin.0000620 mysql-bin.0000621 mysql-bin.0000622 mysql-bin.0000623
  26. Finding bad transactions in binlogs # at 4 #190103 11:17:14

    server id 1 end_log_pos 123 CRC32 0x60730f6b Start: binlog v 4, server v 5.7.17-log created 190103 11:17:14 at startup ROLLBACK/*!*/; BINLOG ' GtMtXA8BAAAAdwAAAHsAAAAAAAQANS43LjE3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAa0y1cEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AWsPc2A= '/*!*/; # at 123 #190103 11:17:14 server id 1 end_log_pos 194 CRC32 0x48f18c88 Previous-GTIDs # 00020192-1111-1111-1111-111111111111:1-59023 # at 194
  27. gh-mysql-rewind: preparation • Identifies the bad GTID transactions. • Identifies

    which binary logs contain those transactions. • Generates flashback per-binlog
  28. gh-mysql-rewind: inject GTID MariaDB/flashback do not speak MySQL-GTID. We use

    some good old awk hacking to inject dummy GTID entries. mysqlbinlog —flashback … | 
 awk 'BEGIN {"uuidgen -r" |& getline u} /^BEGIN/ {c += 1 ; print "SET @@SESSION.GTID_NEXT= \x27" u ":" c "\x27/*!*/;"} {print}' | sed -e s/', @@session.check_constraint_checks=1//g’
  29. gh-mysql-rewind: apply cat all this into the broken MySQL server.

    We have now reverted the server into some consistent point in history, at or (more likely) before the split. !
  30. gh-mysql-rewind: back in time Have we just made things even

    worse? Where are we? gtid_executed is a mess! What happens if we reconnect the server into the topology? ! ! ! Binary logs
  31. 00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:1-4805 00020192-1111-1111-1111-111111111111:4806-5042 Computing target GTID RESET MASTER
 SET

    GLOBAL gtid_purged := <computation>
  32. Join topology as healthy replica CHANGE MASTER TO
 MASTER_HOST=<healthy host>,


    MASTER_AUTO_POSITION=1
 
 ⚡ ! ! ! ! ! !
  33. Limitations Cannot rewind DDL JSON, POINT not supported Full binlog

    rollback means more recovery time Local run on each server One server at a time • With careful planning, the operation could be applied on a single server and propagate to its replicas.
  34. Testing Is this reliable?

  35. Testing Continuous testing in production!

  36. Testing Ideally: checksum entire data, contaminate, rewind, checksum again. But…

    • Data set is too big, tests would take days • How can we predict where (in history) gh-mysql-rewind will drop us at?
  37. ! ! ! ! ! ! STOP SLAVE CHANGE MASTER

    TO MASTER_AUTO_POSITION=1 Rotates relay logs FLUSH BINARY LOGS Opens a new binary log We predict that gh-mysql-rewind will drop us back into current position. Testing, a dedicated replica
  38. START SLAVE IO_THREAD Sleep 30sec STOP SLAVE We aggregate 30sec

    of production traffic in a relay log. Testing: grab production data ! Relay log
  39. We parse the relay log and list affected tables. We

    use all “small” tables, And a single “large” table. We checksum all these tables. Testing: analyze production data ! Relay log
  40. ! START SLAVE SQL_THREAD; SELECT MASTER_POS_WAIT(<relaylog end coordinates>) We apply

    the entire content of the relay log Testing: apply production data Relay log Binary log
  41. ! For each table, we issue: DELETE FROM <t> ORDER

    BY id DESC LIMIT 10 DELETE FROM <t> LIMIT 10 And to make things worse: START SLAVE Sleep 30sec STOP SLAVE Testing: contaminate Relay log Binary log
  42. ! Replication is likely to be broken. Data is corrupted.

    A mess.
  43. Testing: gh-mysql-rewind Connect to a healthy server. gh-mysql-rewind will rewind

    the dirty writes, and past those changes. Our choice of 30sec was intentional: it will rewind back at our FLUSH point. ! ! ! ! ! !
  44. Testing: checksum We checksum all the affected tables again. We

    expect 100% match. ! ! ! ! ! !
  45. Testing: iterate! The replica is back in the topology. Wait

    for it to catch up, run again, and again, and again. Complementary test exists to confirm our “point of arrival” calculation is correct, however, replication success makes it redundant. ! ! ! ! ! !
  46. Expectations To never use gh-mysql-rewind. To rewind & join replication

    within minutes as opposed to hours. Smaller binary logs can help.
  47. Status Shell script. To be released shortly to the community.

    At some stage integrate into orchestrator. But for that, we will need remote access.
  48. In development at MySQL Query & manipulate GTID over the

    client protocol:
  49. Questions? github.com/shlomi-noach @ShlomiNoach Thank you!