Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[ZendCon 2013] MySQL HA, Recovery and Load Balancing

[ZendCon 2013] MySQL HA, Recovery and Load Balancing

Using MySQL Server and Percona XtraDB Cluster we will take a look at several possible replication configurations for MySQL high-availability, disaster recovery and load-balancing. After comparing them for administration ease, scalability of reads and writes, robustness of the system for errors and time for data recovery as well as looking at how they impact your code-layer we will discuss the ins-and-outs of each setup, looking at the pros and cons, and what trade-offs you will need to make depending on your needs.

Davey Shafik

October 09, 2013
Tweet

More Decks by Davey Shafik

Other Decks in Programming

Transcript

  1. •Community Engineer at Engine Yard •Author of Zend PHP 5

    Certification Study Guide, Sitepoints PHP Anthology: 101 Essential Tips, Tricks & Hacks & PHP Master: Write Cutting Edge Code •A contributor to Zend Framework 1 & 2, phpdoc, and PHP internals •@dshafik Davey Shafik Thursday, October 10, 13
  2. High Availability (HA) • Not: servers that don’t die •

    Ensures service availability, not server availability 4 Thursday, October 10, 13
  3. Disaster Recovery (DR) • Process for recovering from different failure

    scenarios • Automated or not (or in-between) • Planning • Something to recover from • Something to recover to 6 Thursday, October 10, 13
  4. Load Balancing • Balancing the load between multiple machines •

    Smart (based on system resources) • Dumb (round-robin) • ELB is easy, but slow • HAProxy is super faster (and still pretty easy) 8 Thursday, October 10, 13
  5. Why? • Curiosity — MySQL admin is fun • Goal:

    Make system predictable — then it can be automated (some) • Impacts of design decisions • Percona Cluster: oooh shiny! 10 Thursday, October 10, 13
  6. What • HA + Performance • Minimal code impact (drop-in

    system) • Simple (less moving parts) 12 Thursday, October 10, 13
  7. Semi-Sync Replication • Improved Data Integrity • Master blocks until

    1 slave has written to relay log, flushed to disk & acknowledged • Slower... a little. • Failover to asynchronous replication 14 Thursday, October 10, 13
  8. Disaster Recovery • Primary slave is pre-determined failover master •

    Will have the most up-to-date content (not necessarily ALL the content) • Slaves must re-connect to new master • CHANGE MASTER to correct (new) position • Switch app to new master 19 Thursday, October 10, 13
  9. Performance • Benchmarker: 1 XL AWS Instance • Master: 1

    AWS Medium Instance • Slaves: 3 AWS Large Instances (one semi-sync) • HAProxy: 1 AWS Large Instance 21 Thursday, October 10, 13
  10. Benchmarks • Sysbench OLTP Benchmark • 1.5 Million Rows •

    64 Connection • InnoDB Storage Engine • 50,000 Queries • Optionally: Read-only 22 Thursday, October 10, 13
  11. Performance • Writing 1.5 Million Records, with indexes: • 1m

    13s seconds (replicated) • Sysbench OLTP Benchmark: • Read/Write (with semi-sync): 2850/qps • Read/Write (without semi-sync): 2900/qps • Read only: 10850/qps 23 Thursday, October 10, 13
  12. Global Transaction IDs (GTIDs) • Unique identifier across all server

    (binlogs) • combines server UUID:Transaction ID • Transaction based • No need for log files or positions with CHANGE MASTER • Consistent group-wide • Easier failover 25 Thursday, October 10, 13
  13. GTID Master & Slave Setup gtid-mode = ON log-bin enforce-gtid-consistency

    = ON log-slave-updates = ON Thursday, October 10, 13
  14. GTID Slave Setup • STOP SLAVE; • CHANGE MASTER TO

    MASTER_HOST='<host> ', MASTER_USER='<user>', MASTER_PASSWORD='<password>', MASTER_AUTO_POSITION=1; • START SLAVE; 28 Thursday, October 10, 13
  15. Disaster Recovery: mysqlfailover • Performs replication health monitoring • handles

    automatic failover • automatic !== immediate (interval in seconds >= 5s) • Hook scripts for failed check, before/after failover, after complete recovery, 30 Thursday, October 10, 13
  16. Hooks • Failed check: • Alert somebody • Set app

    to maintenance • Before failover: • Alert somebody • Log Something • Take backups 31 Thursday, October 10, 13
  17. Hooks • After Failover: • Alert Somebody • After complete

    recovery: • Modify Application Config • Switchover elastic IPs (amazon) • Verify Application • Alert Somebody! 32 Thursday, October 10, 13
  18. Performance • Benchmarker: 1 XL AWS Instance • Master: 1

    AWS Medium Instance • Slaves: 3 AWS Large Instances • HAProxy: 1 AWS Large Instance 34 Thursday, October 10, 13
  19. Performance • Writing 1.5 Million Records, with indexes: • 60

    seconds (replicated) • Sysbench OLTP Benchmark: • Read/Write: 2950/qps • Read only: 9500/qps 35 Thursday, October 10, 13
  20. Multi-Master Replication • Everything replicates to everything • Uses XtraDB

    table engine (innodb compatible) — not NDB • Automatically transfer state to newly introduced nodes • Still has a “primary” server • Based on Galera for multi-master writes 37 Thursday, October 10, 13
  21. Primary Server Setup [mysqld] wsrep_cluster_address=gcomm:// datadir=/var/lib/mysql user=mysql binlog_format=ROW wsrep_provider=/usr/lib/libgalera_smm.so wsrep_slave_threads=2

    wsrep_cluster_name=benchmarking wsrep_sst_method=xtrabackup wsrep_node_name=<name> innodb_locks_unsafe_for_binlog=1 innodb_autoinc_lock_mode=2 Thursday, October 10, 13
  22. Other Server Setup [mysqld] wsrep_cluster_address=gcomm://<ip> datadir=/var/lib/mysql user=mysql binlog_format=ROW wsrep_provider=/usr/lib/libgalera_smm.so wsrep_slave_threads=2

    wsrep_cluster_name=benchmarking wsrep_sst_method=xtrabackup wsrep_node_name=<name> innodb_locks_unsafe_for_binlog=1 innodb_autoinc_lock_mode=2 Thursday, October 10, 13
  23. Disaster Recovery • Choose a new Primary server • Update

    config • Restart all servers 42 Thursday, October 10, 13
  24. Performance • Unable to test! • More than 4 concurrent

    “users” resulted in auto increment primary key collisions • Unable to complete the benchmark at all with >= 16 threads • ALERT: failed to execute mysql_stmt_execute(): Err1062 Duplicate entry '500519' for key 'PRIMARY' FATAL: database error, exiting... 44 Thursday, October 10, 13
  25. Other Issues • Replication was very brittle, restarting the daemon

    would sometimes break it • Restarting the server almost always break it • Adding new nodes very often failed • Completely unable to get the system up and running with the latest version for this talk. 46 Thursday, October 10, 13
  26. Read/Write Splitting { "myapp": { "master": { "master_0": { "host":

    "localhost", "socket": "\/tmp\/mysql.sock" } }, "slave": { "slave_0": { "host": "192.168.2.27", "port": "3306" } } } } Thursday, October 10, 13
  27. Read/Write Splitting SELECT * FROM foo; -- Use Slave /*

    MYSQLND_MS_SLAVE_SWITCH */SELECT * FROM foo; -- Use Slave SELECT * FROM foo; INSERT INTO bar VALUES ('baz'); -- Use Slave! /* MYSQLND_MS_MASTER_SWITCH */SELECT * FROM foo; -- Use Master /* MYSQLND_MS_LAST_USED_SWITCH */SELECT * FROM foo; -- Use whatever -- was last used Thursday, October 10, 13
  28. Other Resources • Semi-Sync Replication: • http://ey.io/18PllQP • Replication with

    GTIDs: • http://ey.io/15nTo20 • MySQL Utilities (include mysqlfailover): • http://ey.io/GGP2Kh • Percona XtraDB Cluster: • http://ey.io/1gqyyUn 51 http://ey.io/zc13-mysql Thursday, October 10, 13