Making PagerDuty More Reliable Using XtraDB Cluster

Making PagerDuty More Reliable Using XtraDB Cluster

Percona Live Session:

Databases are tough to deploy as an HA solution. Most companies have multiple database servers, but when disaster strikes, they find their systems don't handle the failure well. Failing over to a cold slave could still keep production unavailable. The failover process could be error prone, leading to split brain setups. There has to be a better way.

In this talk, we will discuss how PagerDuty switched our production MySQL system (running in EC2) from a DRBD based solution to Percona XtraDB Cluster based on Codership's synchronous replication library, Galera. With that change, we now have a MySQL setup where setting up a new MySQL server is completely automated, including grabbing a copy of the production dataset. We can withstand losing one of our servers with no downtime. Most importantly, each MySQL server is effectively equivalent, making our lives that much easier.

This talk will cover why we chose XtraDB Cluster, and what benchmarks we ran to test its suitability for handling our production traffic (including a couple important configuration changes). We will explore Galera's synchronous replication scheme, and the application tradeoffs you need to understand before switching to this system. We will discuss the roll out process, how we monitor the produciton cluster, and how we test its ability to handle failure in production.


Doug Barth

April 15, 2015