[ZendCon 2013] MySQL HA, Recovery and Load Balancing

MySQL HA, DR and Load Balancing Thursday, October 10, 13

•Community Engineer at Engine Yard •Author of Zend PHP 5
Certification Study Guide, Sitepoints PHP Anthology: 101 Essential Tips, Tricks & Hacks & PHP Master: Write Cutting Edge Code •A contributor to Zend Framework 1 & 2, phpdoc, and PHP internals •@dshafik Davey Shafik Thursday, October 10, 13

High Availability Thursday, October 10, 13

High Availability (HA) • Not: servers that don’t die •
Ensures service availability, not server availability 4 Thursday, October 10, 13

Disaster Recovery Thursday, October 10, 13

Disaster Recovery (DR) • Process for recovering from different failure
scenarios • Automated or not (or in-between) • Planning • Something to recover from • Something to recover to 6 Thursday, October 10, 13

Load Balancing Thursday, October 10, 13

Load Balancing • Balancing the load between multiple machines •
Smart (based on system resources) • Dumb (round-robin) • ELB is easy, but slow • HAProxy is super faster (and still pretty easy) 8 Thursday, October 10, 13

Why? Thursday, October 10, 13

Why? • Curiosity — MySQL admin is fun • Goal:
Make system predictable — then it can be automated (some) • Impacts of design decisions • Percona Cluster: oooh shiny! 10 Thursday, October 10, 13

What Thursday, October 10, 13

What • HA + Performance • Minimal code impact (drop-in
system) • Simple (less moving parts) 12 Thursday, October 10, 13

MySQL 5.5 Thursday, October 10, 13

Semi-Sync Replication • Improved Data Integrity • Master blocks until
1 slave has written to relay log, ﬂushed to disk & acknowledged • Slower... a little. • Failover to asynchronous replication 14 Thursday, October 10, 13

Thursday, October 10, 13

Semi-Sync Master Setup INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so'; rpl_semi_sync_master_enabled =
{0|1}; rpl_semi_sync_master_timeout = N; Thursday, October 10, 13

Semi-Sync Slave Setup INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so'; rpl_semi_sync_slave_enabled =
{0|1}; Thursday, October 10, 13

Disaster Recovery • Primary slave is pre-determined failover master •
Will have the most up-to-date content (not necessarily ALL the content) • Slaves must re-connect to new master • CHANGE MASTER to correct (new) position • Switch app to new master 19 Thursday, October 10, 13

Performance Thursday, October 10, 13

Performance • Benchmarker: 1 XL AWS Instance • Master: 1
AWS Medium Instance • Slaves: 3 AWS Large Instances (one semi-sync) • HAProxy: 1 AWS Large Instance 21 Thursday, October 10, 13

Benchmarks • Sysbench OLTP Benchmark • 1.5 Million Rows •
64 Connection • InnoDB Storage Engine • 50,000 Queries • Optionally: Read-only 22 Thursday, October 10, 13

Performance • Writing 1.5 Million Records, with indexes: • 1m
13s seconds (replicated) • Sysbench OLTP Benchmark: • Read/Write (with semi-sync): 2850/qps • Read/Write (without semi-sync): 2900/qps • Read only: 10850/qps 23 Thursday, October 10, 13

MySQL 5.6 Thursday, October 10, 13

Global Transaction IDs (GTIDs) • Unique identiﬁer across all server
(binlogs) • combines server UUID:Transaction ID • Transaction based • No need for log ﬁles or positions with CHANGE MASTER • Consistent group-wide • Easier failover 25 Thursday, October 10, 13

GTID Master & Slave Setup gtid-mode = ON log-bin enforce-gtid-consistency
= ON log-slave-updates = ON Thursday, October 10, 13

GTID Slave Setup • STOP SLAVE; • CHANGE MASTER TO
MASTER_HOST='<host> ', MASTER_USER='<user>', MASTER_PASSWORD='<password>', MASTER_AUTO_POSITION=1; • START SLAVE; 28 Thursday, October 10, 13

Disaster Recovery: mysqlfailover • Performs replication health monitoring • handles
automatic failover • automatic !== immediate (interval in seconds >= 5s) • Hook scripts for failed check, before/after failover, after complete recovery, 30 Thursday, October 10, 13

Hooks • Failed check: • Alert somebody • Set app
to maintenance • Before failover: • Alert somebody • Log Something • Take backups 31 Thursday, October 10, 13

Hooks • After Failover: • Alert Somebody • After complete
recovery: • Modify Application Conﬁg • Switchover elastic IPs (amazon) • Verify Application • Alert Somebody! 32 Thursday, October 10, 13

Performance • Benchmarker: 1 XL AWS Instance • Master: 1
AWS Medium Instance • Slaves: 3 AWS Large Instances • HAProxy: 1 AWS Large Instance 34 Thursday, October 10, 13

Performance • Writing 1.5 Million Records, with indexes: • 60
seconds (replicated) • Sysbench OLTP Benchmark: • Read/Write: 2950/qps • Read only: 9500/qps 35 Thursday, October 10, 13

Percona XtraDB Cluster Thursday, October 10, 13

Multi-Master Replication • Everything replicates to everything • Uses XtraDB
table engine (innodb compatible) — not NDB • Automatically transfer state to newly introduced nodes • Still has a “primary” server • Based on Galera for multi-master writes 37 Thursday, October 10, 13

Primary Server Setup [mysqld] wsrep_cluster_address=gcomm:// datadir=/var/lib/mysql user=mysql binlog_format=ROW wsrep_provider=/usr/lib/libgalera_smm.so wsrep_slave_threads=2
wsrep_cluster_name=benchmarking wsrep_sst_method=xtrabackup wsrep_node_name=<name> innodb_locks_unsafe_for_binlog=1 innodb_autoinc_lock_mode=2 Thursday, October 10, 13

Other Server Setup [mysqld] wsrep_cluster_address=gcomm://<ip> datadir=/var/lib/mysql user=mysql binlog_format=ROW wsrep_provider=/usr/lib/libgalera_smm.so wsrep_slave_threads=2
wsrep_cluster_name=benchmarking wsrep_sst_method=xtrabackup wsrep_node_name=<name> innodb_locks_unsafe_for_binlog=1 innodb_autoinc_lock_mode=2 Thursday, October 10, 13

Disaster Recovery • Choose a new Primary server • Update
conﬁg • Restart all servers 42 Thursday, October 10, 13

Performance • Unable to test! • More than 4 concurrent
“users” resulted in auto increment primary key collisions • Unable to complete the benchmark at all with >= 16 threads • ALERT: failed to execute mysql_stmt_execute(): Err1062 Duplicate entry '500519' for key 'PRIMARY' FATAL: database error, exiting... 44 Thursday, October 10, 13

Other Issues Thursday, October 10, 13

Other Issues • Replication was very brittle, restarting the daemon
would sometimes break it • Restarting the server almost always break it • Adding new nodes very often failed • Completely unable to get the system up and running with the latest version for this talk. 46 Thursday, October 10, 13

Read/Write Splitting mysqlnd_ms Thursday, October 10, 13

Read/Write Splitting $ pecl install mysqlnd_ms mysqlnd_ms.enable=1 mysqlnd_ms.config_file= /path/to/mysqlnd_ms.json Thursday,
October 10, 13

Read/Write Splitting { "myapp": { "master": { "master_0": { "host":
"localhost", "socket": "\/tmp\/mysql.sock" } }, "slave": { "slave_0": { "host": "192.168.2.27", "port": "3306" } } } } Thursday, October 10, 13

Read/Write Splitting SELECT * FROM foo; -- Use Slave /*
MYSQLND_MS_SLAVE_SWITCH */SELECT * FROM foo; -- Use Slave SELECT * FROM foo; INSERT INTO bar VALUES ('baz'); -- Use Slave! /* MYSQLND_MS_MASTER_SWITCH */SELECT * FROM foo; -- Use Master /* MYSQLND_MS_LAST_USED_SWITCH */SELECT * FROM foo; -- Use whatever -- was last used Thursday, October 10, 13

Other Resources • Semi-Sync Replication: • http://ey.io/18PllQP • Replication with
GTIDs: • http://ey.io/15nTo20 • MySQL Utilities (include mysqlfailover): • http://ey.io/GGP2Kh • Percona XtraDB Cluster: • http://ey.io/1gqyyUn 51 http://ey.io/zc13-mysql Thursday, October 10, 13

Feedback & Questions: Feedback: https://joind.in/9093 Twitter: @dshaﬁk Email: [email protected] http://ey.io/zc13-mysql

[ZendCon 2013] MySQL HA, Recovery and Load Bala...

[ZendCon 2013] MySQL HA, Recovery and Load Balancing

More Decks by Davey Shafik

Other Decks in Programming

Featured

Transcript