Slide 1

Slide 1 text

MySQL Un-split brain
 aka Move Back in Time via gh-mysql-rewind Shlomi Noach GitHub FOSDEM 2019

Slide 2

Slide 2 text

About me @github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org 
 github.com/shlomi-noach
 @ShlomiNoach

Slide 3

Slide 3 text

GitHub
 Built for developers Largest open source hosting 100M+ repositories, 31M+ developers, 
 2.1M+ organizations Supplier of octocat T-Shirts and stickers

Slide 4

Slide 4 text

Incentive Expected MySQL split brain, with unexpected requirement.

Slide 5

Slide 5 text

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Failover & split brain

Slide 6

Slide 6 text

Incentive The time it took to restore data on the demoted topology

Slide 7

Slide 7 text

Incentive Could we just roll back the changes and move back in time?

Slide 8

Slide 8 text

gh-mysql-rewind Rewind a dirty server back in time and connect it as a healthy replica in a replication topology.

Slide 9

Slide 9 text

Iteratively restore hosts into replication topology ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Slide 10

Slide 10 text

gh-mysql-rewind How?

Slide 11

Slide 11 text

GTID The good parts

Slide 12

Slide 12 text

GTID Each server keeps track of: • gtid_executed: all transactions ever executed:
 00020192-1111-1111-1111-111111111111:1-130541 • gtid_purged: of which, some may have been purged:
 00020192-1111-1111-1111-111111111111:1-120008

Slide 13

Slide 13 text

! ! ! Binary logs ! ! ! Binary logs Split brain: GTID contents ! ! ! ! ! ! Binary logs

Slide 14

Slide 14 text

00020192-1111-1111-1111-111111111111:1-5042 Split brain: gtid_executed gtid_executed on demoted master:

Slide 15

Slide 15 text

00020192-1111-1111-1111-111111111111:1-5000,
 00020193-2222-2222-2222-222222222222:1-200 Split brain: gtid_executed gtid_executed on promoted master:

Slide 16

Slide 16 text

00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:5001-5042 00020192-1111-1111-1111-111111111111:1-5000,
 00020193-2222-2222-2222-222222222222:1-200 Identifying bad GTID transactions gtid_executed (demoted) - gtid_executed (promoted)

Slide 17

Slide 17 text

Row based replication With binlog_row_image=FULL: Each INSERT, DELETE, UPDATE is expressed in the binary log as the complete row image before/after change.

Slide 18

Slide 18 text

Row based replication BINLOG ' L9MtXBMBAAAANgAAAJAWAAAAAG4AAAAAAAEABG1ldGEACWhlYXJ0YmVhdAACAxEBBgCk9AE9 L9MtXB8BAAAAPAAAAMwWAAAAAG4AAAAAAAEAAgAC///8AQAAAFwt0y4BQYr8AQAAAFwt0y8BWKiT IZN+ '/*!*/; ### UPDATE `meta`.`heartbeat` ### WHERE ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507054.082314 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ ### SET ### @1=1 /* INT meta=0 nullable=0 is_null=0 */ ### @2=1546507055.088232 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */ # at 5836 #190103 11:17:35 server id 1 end_log_pos 5867 CRC32 0x2cf60376 Xid = 114 COMMIT/*!*/;

Slide 19

Slide 19 text

MariaDB, flashback Developed by Alibaba Contributed to MySQL and MariaDB Implemented in MariaDB’s mysqlbinlog: • mysqlbinlog --flashback

Slide 20

Slide 20 text

flashback example, pseudo code insert(1, 'a') insert(2, 'b') insert(3, 'c') update(2, 'b')->(2, 'second') update(3, 'c')->(3, 'third') insert(4, 'd') delete(1, 'a') insert(1, 'a') delete(4, 'd') update(3, 'third')->(3, 'c') update(2, 'second')->(2, 'b') delete(3, 'c') delete(2, 'b') delete(1, 'a')

Slide 21

Slide 21 text

rewind GTID provides information on “what diverged”. flashback provides the mechanics to undo changes. Now, what exactly do we need to rewind?

Slide 22

Slide 22 text

Time axis ! !

Slide 23

Slide 23 text

Rewind to split point ! !

Slide 24

Slide 24 text

Rewind beyond split point ! !

Slide 25

Slide 25 text

! ! ! Binary logs Finding bad transactions in binlogs mysql-bin.0000620 mysql-bin.0000621 mysql-bin.0000622 mysql-bin.0000623

Slide 26

Slide 26 text

Finding bad transactions in binlogs # at 4 #190103 11:17:14 server id 1 end_log_pos 123 CRC32 0x60730f6b Start: binlog v 4, server v 5.7.17-log created 190103 11:17:14 at startup ROLLBACK/*!*/; BINLOG ' GtMtXA8BAAAAdwAAAHsAAAAAAAQANS43LjE3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAa0y1cEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA AWsPc2A= '/*!*/; # at 123 #190103 11:17:14 server id 1 end_log_pos 194 CRC32 0x48f18c88 Previous-GTIDs # 00020192-1111-1111-1111-111111111111:1-59023 # at 194

Slide 27

Slide 27 text

gh-mysql-rewind: preparation • Identifies the bad GTID transactions. • Identifies which binary logs contain those transactions. • Generates flashback per-binlog

Slide 28

Slide 28 text

gh-mysql-rewind: inject GTID MariaDB/flashback do not speak MySQL-GTID. We use some good old awk hacking to inject dummy GTID entries. mysqlbinlog —flashback … | 
 awk 'BEGIN {"uuidgen -r" |& getline u} /^BEGIN/ {c += 1 ; print "SET @@SESSION.GTID_NEXT= \x27" u ":" c "\x27/*!*/;"} {print}' | sed -e s/', @@session.check_constraint_checks=1//g’

Slide 29

Slide 29 text

gh-mysql-rewind: apply cat all this into the broken MySQL server. We have now reverted the server into some consistent point in history, at or (more likely) before the split. !

Slide 30

Slide 30 text

gh-mysql-rewind: back in time Have we just made things even worse? Where are we? gtid_executed is a mess! What happens if we reconnect the server into the topology? ! ! ! Binary logs

Slide 31

Slide 31 text

00020192-1111-1111-1111-111111111111:1-5042 - 00020192-1111-1111-1111-111111111111:1-4805 00020192-1111-1111-1111-111111111111:4806-5042 Computing target GTID RESET MASTER
 SET GLOBAL gtid_purged :=

Slide 32

Slide 32 text

Join topology as healthy replica CHANGE MASTER TO
 MASTER_HOST=,
 MASTER_AUTO_POSITION=1
 
 ⚡ ! ! ! ! ! !

Slide 33

Slide 33 text

Limitations Cannot rewind DDL JSON, POINT not supported Full binlog rollback means more recovery time Local run on each server One server at a time • With careful planning, the operation could be applied on a single server and propagate to its replicas.

Slide 34

Slide 34 text

Testing Is this reliable?

Slide 35

Slide 35 text

Testing Continuous testing in production!

Slide 36

Slide 36 text

Testing Ideally: checksum entire data, contaminate, rewind, checksum again. But… • Data set is too big, tests would take days • How can we predict where (in history) gh-mysql-rewind will drop us at?

Slide 37

Slide 37 text

! ! ! ! ! ! STOP SLAVE CHANGE MASTER TO MASTER_AUTO_POSITION=1 Rotates relay logs FLUSH BINARY LOGS Opens a new binary log We predict that gh-mysql-rewind will drop us back into current position. Testing, a dedicated replica

Slide 38

Slide 38 text

START SLAVE IO_THREAD Sleep 30sec STOP SLAVE We aggregate 30sec of production traffic in a relay log. Testing: grab production data ! Relay log

Slide 39

Slide 39 text

We parse the relay log and list affected tables. We use all “small” tables, And a single “large” table. We checksum all these tables. Testing: analyze production data ! Relay log

Slide 40

Slide 40 text

! START SLAVE SQL_THREAD; SELECT MASTER_POS_WAIT() We apply the entire content of the relay log Testing: apply production data Relay log Binary log

Slide 41

Slide 41 text

! For each table, we issue: DELETE FROM ORDER BY id DESC LIMIT 10 DELETE FROM LIMIT 10 And to make things worse: START SLAVE Sleep 30sec STOP SLAVE Testing: contaminate Relay log Binary log

Slide 42

Slide 42 text

! Replication is likely to be broken. Data is corrupted. A mess.

Slide 43

Slide 43 text

Testing: gh-mysql-rewind Connect to a healthy server. gh-mysql-rewind will rewind the dirty writes, and past those changes. Our choice of 30sec was intentional: it will rewind back at our FLUSH point. ! ! ! ! ! !

Slide 44

Slide 44 text

Testing: checksum We checksum all the affected tables again. We expect 100% match. ! ! ! ! ! !

Slide 45

Slide 45 text

Testing: iterate! The replica is back in the topology. Wait for it to catch up, run again, and again, and again. Complementary test exists to confirm our “point of arrival” calculation is correct, however, replication success makes it redundant. ! ! ! ! ! !

Slide 46

Slide 46 text

Expectations To never use gh-mysql-rewind. To rewind & join replication within minutes as opposed to hours. Smaller binary logs can help.

Slide 47

Slide 47 text

Status Shell script. To be released shortly to the community. At some stage integrate into orchestrator. But for that, we will need remote access.

Slide 48

Slide 48 text

In development at MySQL Query & manipulate GTID over the client protocol:

Slide 49

Slide 49 text

Questions? github.com/shlomi-noach @ShlomiNoach Thank you!