Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Un-split brain MySQL

Shlomi Noach
February 02, 2019

Un-split brain MySQL

Is there a way to amend a MySQL split brain scenario? When two servers are diverged from each other, is it possible to identify and undo the conflicting changes?

We introduce gh-mysql-rewind, which combines multiple technologies to achieve auto-resolution of data divergence. This presentation explains how gh-mysql-rewind works, and how it is being tested in production to validate its operation.

Shlomi Noach

February 02, 2019
Tweet

More Decks by Shlomi Noach

Other Decks in Technology

Transcript

  1. MySQL Un-split brain

    aka Move Back in Time
    via gh-mysql-rewind
    Shlomi Noach
    GitHub
    FOSDEM 2019

    View full-size slide

  2. About me
    @github/database-infrastructure
    Author of orchestrator, gh-ost, freno, ccql
    and others.
    Blog at http://openark.org

    github.com/shlomi-noach

    @ShlomiNoach

    View full-size slide

  3. GitHub

    Built for developers
    Largest open source hosting
    100M+ repositories, 31M+ developers, 

    2.1M+ organizations
    Supplier of octocat T-Shirts and stickers

    View full-size slide

  4. Incentive
    Expected MySQL split brain, with unexpected requirement.

    View full-size slide

  5. ! !
    !
    !
    !
    !
    ! !
    !
    !
    !
    !
    ! !
    !
    !
    !
    !
    ! !
    !
    !
    !
    !
    Failover & split brain

    View full-size slide

  6. Incentive
    The time it took to restore data on the demoted topology

    View full-size slide

  7. Incentive
    Could we just roll back the changes and move back in time?

    View full-size slide

  8. gh-mysql-rewind
    Rewind a dirty server back in time and connect it as a healthy
    replica in a replication topology.

    View full-size slide

  9. Iteratively restore hosts into replication topology
    ! !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !
    !

    View full-size slide

  10. gh-mysql-rewind
    How?

    View full-size slide

  11. GTID
    The good parts

    View full-size slide

  12. GTID
    Each server keeps track of:
    • gtid_executed: all transactions ever executed:

    00020192-1111-1111-1111-111111111111:1-130541
    • gtid_purged: of which, some may have been purged:

    00020192-1111-1111-1111-111111111111:1-120008

    View full-size slide

  13. !
    !
    !
    Binary logs
    ! !
    !
    Binary logs
    Split brain: GTID contents
    ! !
    !
    !
    !
    !
    Binary logs

    View full-size slide

  14. 00020192-1111-1111-1111-111111111111:1-5042
    Split brain: gtid_executed
    gtid_executed on demoted master:

    View full-size slide

  15. 00020192-1111-1111-1111-111111111111:1-5000,

    00020193-2222-2222-2222-222222222222:1-200
    Split brain: gtid_executed
    gtid_executed on promoted master:

    View full-size slide

  16. 00020192-1111-1111-1111-111111111111:1-5042
    -
    00020192-1111-1111-1111-111111111111:5001-5042
    00020192-1111-1111-1111-111111111111:1-5000,

    00020193-2222-2222-2222-222222222222:1-200
    Identifying bad GTID transactions
    gtid_executed (demoted) - gtid_executed (promoted)

    View full-size slide

  17. Row based replication
    With binlog_row_image=FULL:
    Each INSERT, DELETE, UPDATE is expressed in the binary log
    as the complete row image before/after change.

    View full-size slide

  18. Row based replication
    BINLOG '
    L9MtXBMBAAAANgAAAJAWAAAAAG4AAAAAAAEABG1ldGEACWhlYXJ0YmVhdAACAxEBBgCk9AE9
    L9MtXB8BAAAAPAAAAMwWAAAAAG4AAAAAAAEAAgAC///8AQAAAFwt0y4BQYr8AQAAAFwt0y8BWKiT
    IZN+
    '/*!*/;
    ### UPDATE `meta`.`heartbeat`
    ### WHERE
    ### @1=1 /* INT meta=0 nullable=0 is_null=0 */
    ### @2=1546507054.082314 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */
    ### SET
    ### @1=1 /* INT meta=0 nullable=0 is_null=0 */
    ### @2=1546507055.088232 /* TIMESTAMP(6) meta=6 nullable=0 is_null=0 */
    # at 5836
    #190103 11:17:35 server id 1 end_log_pos 5867 CRC32 0x2cf60376 Xid = 114
    COMMIT/*!*/;

    View full-size slide

  19. MariaDB, flashback
    Developed by Alibaba
    Contributed to MySQL and MariaDB
    Implemented in MariaDB’s mysqlbinlog:
    • mysqlbinlog --flashback

    View full-size slide

  20. flashback example, pseudo code
    insert(1, 'a')
    insert(2, 'b')
    insert(3, 'c')
    update(2, 'b')->(2, 'second')
    update(3, 'c')->(3, 'third')
    insert(4, 'd')
    delete(1, 'a')
    insert(1, 'a')
    delete(4, 'd')
    update(3, 'third')->(3, 'c')
    update(2, 'second')->(2, 'b')
    delete(3, 'c')
    delete(2, 'b')
    delete(1, 'a')

    View full-size slide

  21. rewind
    GTID provides information on “what diverged”.
    flashback provides the mechanics to undo changes.
    Now, what exactly do we need to rewind?

    View full-size slide

  22. Rewind to split point
    !
    !

    View full-size slide

  23. Rewind beyond split point
    !
    !

    View full-size slide

  24. ! !
    !
    Binary logs
    Finding bad transactions in binlogs
    mysql-bin.0000620
    mysql-bin.0000621
    mysql-bin.0000622
    mysql-bin.0000623

    View full-size slide

  25. Finding bad transactions in binlogs
    # at 4
    #190103 11:17:14 server id 1 end_log_pos 123 CRC32 0x60730f6b Start: binlog v 4, server v 5.7.17-log created 190103 11:17:14 at
    startup
    ROLLBACK/*!*/;
    BINLOG '
    GtMtXA8BAAAAdwAAAHsAAAAAAAQANS43LjE3LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAa0y1cEzgNAAgAEgAEBAQEEgAAXwAEGggAAAAICAgCAAAACgoKKioAEjQA
    AWsPc2A=
    '/*!*/;
    # at 123
    #190103 11:17:14 server id 1 end_log_pos 194 CRC32 0x48f18c88 Previous-GTIDs
    # 00020192-1111-1111-1111-111111111111:1-59023
    # at 194

    View full-size slide

  26. gh-mysql-rewind: preparation
    • Identifies the bad GTID transactions.
    • Identifies which binary logs contain those transactions.
    • Generates flashback per-binlog

    View full-size slide

  27. gh-mysql-rewind: inject GTID
    MariaDB/flashback do not speak MySQL-GTID.
    We use some good old awk hacking to inject dummy GTID
    entries.
    mysqlbinlog —flashback … | 

    awk 'BEGIN {"uuidgen -r" |& getline u} /^BEGIN/
    {c += 1 ; print "SET @@SESSION.GTID_NEXT=
    \x27" u ":" c "\x27/*!*/;"} {print}' | sed -e s/',
    @@session.check_constraint_checks=1//g’

    View full-size slide

  28. gh-mysql-rewind: apply
    cat all this into the broken MySQL server.
    We have now reverted the server into some consistent point in
    history, at or (more likely) before the split.
    !

    View full-size slide

  29. gh-mysql-rewind: back in time
    Have we just made things even worse?
    Where are we?
    gtid_executed is a mess!
    What happens if we reconnect the
    server into the topology?
    ! !
    !
    Binary logs

    View full-size slide

  30. 00020192-1111-1111-1111-111111111111:1-5042
    -
    00020192-1111-1111-1111-111111111111:1-4805
    00020192-1111-1111-1111-111111111111:4806-5042
    Computing target GTID
    RESET MASTER

    SET GLOBAL gtid_purged :=

    View full-size slide

  31. Join topology as healthy replica
    CHANGE MASTER TO

    MASTER_HOST=,

    MASTER_AUTO_POSITION=1



    ! !
    !
    !
    !
    !

    View full-size slide

  32. Limitations
    Cannot rewind DDL
    JSON, POINT not supported
    Full binlog rollback means more recovery time
    Local run on each server
    One server at a time
    • With careful planning, the operation could be applied on a
    single server and propagate to its replicas.

    View full-size slide

  33. Testing
    Is this reliable?

    View full-size slide

  34. Testing
    Continuous testing in production!

    View full-size slide

  35. Testing
    Ideally: checksum entire data, contaminate, rewind, checksum
    again. But…
    • Data set is too big, tests would take days
    • How can we predict where (in history) gh-mysql-rewind will
    drop us at?

    View full-size slide

  36. ! !
    !
    !
    !
    !
    STOP SLAVE
    CHANGE MASTER TO MASTER_AUTO_POSITION=1
    Rotates relay logs
    FLUSH BINARY LOGS
    Opens a new binary log
    We predict that gh-mysql-rewind will drop us back into
    current position.
    Testing, a dedicated replica

    View full-size slide

  37. START SLAVE IO_THREAD
    Sleep 30sec
    STOP SLAVE
    We aggregate 30sec of production traffic in a relay log.
    Testing: grab production data
    !
    Relay log

    View full-size slide

  38. We parse the relay log and list affected tables.
    We use all “small” tables,
    And a single “large” table.
    We checksum all these tables.
    Testing: analyze production data
    !
    Relay log

    View full-size slide

  39. !
    START SLAVE SQL_THREAD;
    SELECT MASTER_POS_WAIT(coordinates>)
    We apply the entire content of the relay log
    Testing: apply production data
    Relay log
    Binary log

    View full-size slide

  40. !
    For each table, we issue:
    DELETE FROM ORDER BY id DESC LIMIT 10
    DELETE FROM LIMIT 10
    And to make things worse:
    START SLAVE
    Sleep 30sec
    STOP SLAVE
    Testing: contaminate
    Relay log
    Binary log

    View full-size slide

  41. ! Replication is likely to be broken. Data is corrupted.
    A mess.

    View full-size slide

  42. Testing: gh-mysql-rewind
    Connect to a healthy server.
    gh-mysql-rewind will rewind the dirty writes, and past those
    changes.
    Our choice of 30sec was intentional: it will rewind back at our
    FLUSH point.
    ! !
    !
    !
    !
    !

    View full-size slide

  43. Testing: checksum
    We checksum all the affected tables again. We expect 100%
    match.
    ! !
    !
    !
    !
    !

    View full-size slide

  44. Testing: iterate!
    The replica is back in the topology.
    Wait for it to catch up, run again, and again, and again.
    Complementary test exists to confirm our “point of arrival”
    calculation is correct, however, replication success makes it
    redundant.
    ! !
    !
    !
    !
    !

    View full-size slide

  45. Expectations
    To never use gh-mysql-rewind.
    To rewind & join replication within minutes as opposed to hours.
    Smaller binary logs can help.

    View full-size slide

  46. Status
    Shell script.
    To be released shortly to the community.
    At some stage integrate into orchestrator.
    But for that, we will need remote access.

    View full-size slide

  47. In development at MySQL
    Query & manipulate GTID over the client protocol:

    View full-size slide

  48. Questions?
    github.com/shlomi-noach
    @ShlomiNoach
    Thank you!

    View full-size slide