Making PagerDuty More Reliable Using XtraDB Cluster

Making PagerDuty More Reliable Using XtraDB Cluster

Percona Live Session: https://www.percona.com/live/mysql-conference-2015/sessions/making-pagerduty-more-reliable-using-xtradb-cluster

Databases are tough to deploy as an HA solution. Most companies have multiple database servers, but when disaster strikes, they find their systems don't handle the failure well. Failing over to a cold slave could still keep production unavailable. The failover process could be error prone, leading to split brain setups. There has to be a better way.

In this talk, we will discuss how PagerDuty switched our production MySQL system (running in EC2) from a DRBD based solution to Percona XtraDB Cluster based on Codership's synchronous replication library, Galera. With that change, we now have a MySQL setup where setting up a new MySQL server is completely automated, including grabbing a copy of the production dataset. We can withstand losing one of our servers with no downtime. Most importantly, each MySQL server is effectively equivalent, making our lives that much easier.

This talk will cover why we chose XtraDB Cluster, and what benchmarks we ran to test its suitability for handling our production traffic (including a couple important configuration changes). We will explore Galera's synchronous replication scheme, and the application tradeoffs you need to understand before switching to this system. We will discuss the roll out process, how we monitor the produciton cluster, and how we test its ability to handle failure in production.

A97a75c945507f70992f579a730b0657?s=128

Doug Barth

April 15, 2015
Tweet

Transcript

  1. 5.

    4/15/15 PagerDuty stack MAKING PAGERDUTY MORE RELIABLE USING PXC Then

    Now Monothilic Rails App Rails & Scala Cloud hosted Cloud hosted MySQL Community (later Percona Server) Percona XtraDB Cluster Cassandra Zookeeper
  2. 6.

    4/15/15 MySQL at PagerDuty MAKING PAGERDUTY MORE RELIABLE USING PXC

    Data size ~ 600 GB Queries / s 6,000 - 7,500 Txns / s 200 - 300
  3. 8.

    4/15/15 Ye Olden Setup MAKING PAGERDUTY MORE RELIABLE USING PXC

    EBS EBS Primary Secondary DRBD us-west-2a us-west-2b
  4. 11.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC 1. FLUSH TABLES

    WITH READ LOCK 2. Stop MySQL on primary 3. Unmount DRBD volume 4. Set primary to secondary role 5. Flip EIP over to secondary 6. Confirm secondary is now primary 7. Mount DRBD volume on new primary 8. Start MySQL on new primary 9. Wait for clients to reconnect 10. Wait for buffer pool to warm up
  5. 14.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Impact of binlogs

    on DRBD volume (sysbench) txn/s 0 150 300 450 600 # of concurrent clients 1 2 4 8 16 32 64 128 DRBD (data) DRBD (data+binlogs) DRBD (data+binlogs sync)
  6. 22.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC pxc01 pxc02 pxc03

    app01 HAProxy app02 HAProxy app03 HAProxy us-west-2a us-west-2b us-west-2c http://www.mysqlperformanceblog.com/2012/06/20/percona-xtradb-cluster-reference-architecture-with-haproxy/
  7. 23.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC wsrep_node_name = pxc01

    wsrep_sst_donor = pxc03,pxc02 wsrep_sst_method = xtrabackup-v2 SST configuration
  8. 25.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC innodb_blocking_buffer_pool_restore = ON

    innodb_buffer_pool_restore_at_startup = 300 Buffer pool restoration enabled
  9. 26.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC pxc01 pxc02 pxc03

    app01 HAProxy app02 HAProxy app03 HAProxy us-west-2a us-west-2b us-west-1c
  10. 28.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC InnoDB only Primary

    keys on every table Step 1 — Schema compatibility
  11. 29.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 2 —

    Gain Experience Primary pxc01 pxc02 pxc03
  12. 31.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 3 —

    Monitoring wsrep_flow_control_paused wsrep_flow_control_sent wsrep_flow_control_received
  13. 32.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 3 —

    Monitoring wsrep_flow_control_paused wsrep_flow_control_sent wsrep_flow_control_received Stateful counters in 5.5
  14. 33.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Measure usage then

    disable query cache Step 4 — Query cache not supported
  15. 34.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 5 —

    Row base replication Primary Replica SBR PXC RBR
  16. 35.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 6 —

    Locking Locking per node Switch to Zookeeper
  17. 36.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 7 —

    Eliminate large transactions Break up transactions into several smaller ones
  18. 38.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 8 —

    Benchmarks sysbench (writes only) txns/s 0 750 1500 2250 3000 # of concurrent clients 1 2 4 8 16 32 64 128 256 512 Percona Server PXC
  19. 39.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 8 —

    Benchmarks innodb_flush_log_at_trx_commit = 0 innodb_log_file_size = 1G
  20. 40.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 8 —

    Benchmarks sysbench (writes only, tuned IO settings) txns/s 0 750 1500 2250 3000 # of concurrent clients 2 4 8 16 32 64 128 Percona Server (small table) Percona Server (large table) PXC (small) PXC (large)
  21. 41.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Step 8 —

    Benchmarks sysbench (75% reads, 25% writes) txns/s 0 150 300 450 600 # of concurrent clients 2 4 8 16 32 64 128 Percona Server (small table) Percona Server (large table) PXC (small) PXC (large)
  22. 43.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Rollout DRBD pair

    DRBD pair PXC DR Backup Delayed DRBD pair DRBD pair
  23. 44.

    4/15/15 MAKING PAGERDUTY MORE RELIABLE USING PXC Rollout DRBD pair

    DRBD pair PXC DR Backup Delayed DRBD pair DRBD pair
  24. 48.

    4/15/15 Benefit: Rolling changes MAKING PAGERDUTY MORE RELIABLE USING PXC

    pxc01 pxc02 pxc03 app01 HAProxy app02 HAProxy app03 HAProxy us-west-2a us-west-2b us-west-2c
  25. 49.

    4/15/15 Benefit: Rolling changes MAKING PAGERDUTY MORE RELIABLE USING PXC

    pxc01 pxc02 pxc03 app01 HAProxy app02 HAProxy app03 HAProxy us-west-2a us-west-2b us-west-2c
  26. 50.

    4/15/15 Benefit: Rolling changes MAKING PAGERDUTY MORE RELIABLE USING PXC

    pxc01 pxc02 pxc03 app01 HAProxy app02 HAProxy app03 HAProxy us-west-2a us-west-2b us-west-2c
  27. 52.

    4/15/15 Benefit: Moving replicas MAKING PAGERDUTY MORE RELIABLE USING PXC

    http://www.percona.com/blog/2013/06/21/changing-an-async-slave-of-a-pxc-cluster-to-a-new-master/ pxc01 Xid = 2341 mysql-bin.001234 83452 pxc02 Xid = 2341 mysql-bin.003004 98234 backup01 Xid = 2341 mysql-relay.002311 5002
  28. 53.

    4/15/15 Benefit: Moving replicas MAKING PAGERDUTY MORE RELIABLE USING PXC

    http://www.percona.com/blog/2013/06/21/changing-an-async-slave-of-a-pxc-cluster-to-a-new-master/ pxc01 Xid = 2341 mysql-bin.001234 83452 pxc02 Xid = 2341 mysql-bin.003004 98234 backup01 Xid = 2341 mysql-relay.002311 5002