Postgresql and Pacemaker from the Ground UP

Postgresql and Pacemaker from the Ground Up Brian Cosgrove Braintree

What is HA? Why do we want it?

High Availability High availability refers to a system or component
that is continuously operational for a desirably long length of time. Availability can be measured relative to "100% operational" or "never failing." A widely-held but difficult-to- achieve standard of availability for a system or product is known as "five 9s" (99.999 percent) availability[1]. [1] http://searchdatacenter.techtarget.com/definition/high-availability

$50,000,000,000 / year $95,000 / minute http://venturebeat.com/2015/09/17/paypals-braintree-is-now-likely-bigger-than-square-and-stripe-combined/

Designing HA: Detecting failure Reliable and quick detection of failure
allows us to minimize user-facing impact.

Designing HA: Eliminate Single Points of Failure Principle: Add redundancy
to the system so that failure of a component does not mean failure of the entire system. Use PostgreSQL in combination with synchronous replication to provide a hot-standby that can be promoted if the primary fails.

Designing HA: Reliable failover Pacemaker can automate the promotion of
standbys.

“The automated failover of our main production database could be
described as the root cause of both of these downtime events… we've made changes to our Pacemaker configuration to ensure failover of the 'active' database role will only occur when initiated by a member of our operations team.” - A post-mortem

1. Have one candidate for fail-over per-database cluster 2. Fail-over
is a one-way operation - no flapping 3. Let humans take over if Pacemaker is confused Mitigating some of the risks involved in automated failover

The nuts and bolts: Pacemaker and Corosync at Braintree

Our topology Each Postgres cluster gets its own Pacemaker cluster
Protect against split-brains by introducing a third server which only provides a vote in leader elections Achieve even more isolation by running each Pacemaker cluster on its own “heartbeat” VLAN

Putting it together: Resources Resources are controlled by init-like OCF
scripts Resources for our installation fall into the following categories: • VIPs (virtual IP addresses) - “IPAddr2” resource • pgsql resources - these are the Postgres clusters themselves • STONITH - we use a custom resource that operates via SNMP on our APC PDUs

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_configuring_stonith.html

Thanks! Brian Cosgrove Software Engineer twitter.com/cosgroveb brian.cosgrove@getbraintree.com www.braintreepayments.com

Postgresql and Pacemaker from the Ground UP

Postgresql and Pacemaker from the Ground UP

Brian Cosgrove

Other Decks in Programming

Featured

Transcript