Un-split brain MySQL

MySQL Un-split brain  aka Move Back in Time via gh-mysql-rewind
Shlomi Noach GitHub FOSDEM 2019

About me @github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and
others. Blog at http://openark.org   github.com/shlomi-noach  @ShlomiNoach

GitHub  Built for developers Largest open source hosting 100M+ repositories,
31M+ developers,   2.1M+ organizations Supplier of octocat T-Shirts and stickers

Incentive Expected MySQL split brain, with unexpected requirement.

! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! Failover & split brain

Incentive The time it took to restore data on the
demoted topology

Incentive Could we just roll back the changes and move
back in time?

gh-mysql-rewind Rewind a dirty server back in time and connect
it as a healthy replica in a replication topology.

Iteratively restore hosts into replication topology ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

gh-mysql-rewind How?

GTID The good parts

GTID Each server keeps track of: • gtid_executed: all transactions
ever executed:  00020192-1111-1111-1111-111111111111:1-130541 • gtid_purged: of which, some may have been purged:  00020192-1111-1111-1111-111111111111:1-120008

! ! ! Binary logs ! ! ! Binary logs
Split brain: GTID contents ! ! ! ! ! ! Binary logs

00020192-1111-1111-1111-111111111111:1-5042 Split brain: gtid_executed gtid_executed on demoted master:

00020192-1111-1111-1111-111111111111:1-5000,  00020193-2222-2222-2222-222222222222:1-200 Split brain: gtid_executed gtid_executed on promoted master:

00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:5001-5042 00020192-1111-1111-1111-111111111111:1-5000,  00020193-2222-2222-2222-222222222222:1-200 Identifying bad GTID transactions gtid_executed
(demoted) - gtid_executed (promoted)

Row based replication With binlog_row_image=FULL: Each INSERT, DELETE, UPDATE is
expressed in the binary log as the complete row image before/after change.

Row based replication BINLOG ' L9MtXBMBAAAANgAAAJAWAAAAAG4AAAAAAAEABG1ldGEACWhlYXJ0YmVhdAACAxEBBgCk9AE9 L9MtXB8BAAAAPAAAAMwWAAAAAG4AAAAAAAEAAgAC///8AQAAAFwt0y4BQYr8AQAAAFwt0y8BWKiT IZN+ '/*!*/; ###
UPDATE `meta`.`heartbeat` ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507054.082314 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ ### SET ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507055.088232 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ # at 5836 #190103 11:17:35 server id 1 end_log_pos 5867 CRC32 0x2cf60376 Xid = 114 COMMIT/*!*/;

MariaDB, ﬂashback Developed by Alibaba Contributed to MySQL and MariaDB
Implemented in MariaDB’s mysqlbinlog: • mysqlbinlog --flashback

ﬂashback example, pseudo code insert(1, 'a') insert(2, 'b') insert(3, 'c')
update(2, 'b')->(2, 'second') update(3, 'c')->(3, 'third') insert(4, 'd') delete(1, 'a') insert(1, 'a') delete(4, 'd') update(3, 'third')->(3, 'c') update(2, 'second')->(2, 'b') delete(3, 'c') delete(2, 'b') delete(1, 'a')

rewind GTID provides information on “what diverged”. flashback provides the
mechanics to undo changes. Now, what exactly do we need to rewind?

Time axis ! !

Rewind to split point ! !

Rewind beyond split point ! !

! ! ! Binary logs Finding bad transactions in binlogs
mysql-bin.0000620 mysql-bin.0000621 mysql-bin.0000622 mysql-bin.0000623

Finding bad transactions in binlogs # at 4 #190103 11:17:14
server id 1 end_log_pos 123 CRC32 0x60730f6b Start: binlog v 4, server v 5.7.17-log created 190103 11:17:14 at startup ROLLBACK/*!*/; BINLOG ' GtMtXA8BAAAAdwAAAHsAAAAAAAQANS43LjE3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAa0y1cEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AWsPc2A= '/*!*/; # at 123 #190103 11:17:14 server id 1 end_log_pos 194 CRC32 0x48f18c88 Previous-GTIDs # 00020192-1111-1111-1111-111111111111:1-59023 # at 194

gh-mysql-rewind: preparation • Identifies the bad GTID transactions. • Identifies
which binary logs contain those transactions. • Generates flashback per-binlog

gh-mysql-rewind: inject GTID MariaDB/flashback do not speak MySQL-GTID. We use
some good old awk hacking to inject dummy GTID entries. mysqlbinlog —flashback … |   awk 'BEGIN {"uuidgen -r" |& getline u} /^BEGIN/ {c += 1 ; print "SET @@SESSION.GTID_NEXT= \x27" u ":" c "\x27/*!*/;"} {print}' | sed -e s/', @@session.check_constraint_checks=1//g’

gh-mysql-rewind: apply cat all this into the broken MySQL server.
We have now reverted the server into some consistent point in history, at or (more likely) before the split. !

gh-mysql-rewind: back in time Have we just made things even
worse? Where are we? gtid_executed is a mess! What happens if we reconnect the server into the topology? ! ! ! Binary logs

00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:1-4805 00020192-1111-1111-1111-111111111111:4806-5042 Computing target GTID RESET MASTER  SET
GLOBAL gtid_purged := <computation>

Join topology as healthy replica CHANGE MASTER TO  MASTER_HOST=<healthy host>, 
MASTER_AUTO_POSITION=1    ⚡ ! ! ! ! ! !

Limitations Cannot rewind DDL JSON, POINT not supported Full binlog
rollback means more recovery time Local run on each server One server at a time • With careful planning, the operation could be applied on a single server and propagate to its replicas.

Testing Is this reliable?

Testing Continuous testing in production!

Testing Ideally: checksum entire data, contaminate, rewind, checksum again. But…
• Data set is too big, tests would take days • How can we predict where (in history) gh-mysql-rewind will drop us at?

! ! ! ! ! ! STOP SLAVE CHANGE MASTER
TO MASTER_AUTO_POSITION=1 Rotates relay logs FLUSH BINARY LOGS Opens a new binary log We predict that gh-mysql-rewind will drop us back into current position. Testing, a dedicated replica

START SLAVE IO_THREAD Sleep 30sec STOP SLAVE We aggregate 30sec
of production traffic in a relay log. Testing: grab production data ! Relay log

We parse the relay log and list affected tables. We
use all “small” tables, And a single “large” table. We checksum all these tables. Testing: analyze production data ! Relay log

! START SLAVE SQL_THREAD; SELECT MASTER_POS_WAIT(<relaylog end coordinates>) We apply
the entire content of the relay log Testing: apply production data Relay log Binary log

! For each table, we issue: DELETE FROM <t> ORDER
BY id DESC LIMIT 10 DELETE FROM <t> LIMIT 10 And to make things worse: START SLAVE Sleep 30sec STOP SLAVE Testing: contaminate Relay log Binary log

! Replication is likely to be broken. Data is corrupted.
A mess.

Testing: gh-mysql-rewind Connect to a healthy server. gh-mysql-rewind will rewind
the dirty writes, and past those changes. Our choice of 30sec was intentional: it will rewind back at our FLUSH point. ! ! ! ! ! !

Testing: checksum We checksum all the affected tables again. We
expect 100% match. ! ! ! ! ! !

Testing: iterate! The replica is back in the topology. Wait
for it to catch up, run again, and again, and again. Complementary test exists to confirm our “point of arrival” calculation is correct, however, replication success makes it redundant. ! ! ! ! ! !

Expectations To never use gh-mysql-rewind. To rewind & join replication
within minutes as opposed to hours. Smaller binary logs can help.

Status Shell script. To be released shortly to the community.
At some stage integrate into orchestrator. But for that, we will need remote access.

In development at MySQL Query & manipulate GTID over the
client protocol:

Questions? github.com/shlomi-noach @ShlomiNoach Thank you!

Un-split brain MySQL

Un-split brain MySQL

More Decks by Shlomi Noach

Other Decks in Technology

Featured

Transcript