Reliable Crash Detection and Failover with Orchestrator

How people build software 1 "   Reliable Crash Detection
and Failover with Orchestrator Shlomi Noach, PerconaLive 2016

How people build software # Agenda • Orchestrator • Topologies,
crash scenarios • Crash detection methods • Promotion complexity • Limbo states, split brain • Flapping & acknowledgement • Visibility & control • Conﬁguration vs. State based analysis & recovery • State of the orchestra 2 #

How people build software # Orchestrator • MySQL replication topology
manager • github.com/outbrain/orchestrator • Free & open source 3 #

How people build software # 4 #

How people build software # 5 # ! ! !
! Simple replication What could possibly go wrong?

How people build software # Crash detection 6 #

! Observe/monitor " How do you observe your database availability?

! Monitor master only " Common: ping, check :3306, issue SELECT 1

! # " And if response is bad? - is this a false positive? - try again - and again? - How many times until you’re sure? How much time have you lost? Monitor master only

! Orchestrator’s observation Continuously probes your MySQL servers - Figuring out who replicates from who - Building the topology tree - Understands replication rules - At time of crash, knows what set setup should have been $ $ $ $

! Observe entire topology Holistic approach, used by Orchestrator " MySQL monitoring calls for MySQL speciﬁc solution - Monitor master and replicas (issue queries) - Check replicas status - Make an analysis based on result from all servers involved.

! ! ! ! Multi layered/multi DC replication How do you check an intermediate master (IM) availability? "

! ! ! ! Multi layered/multi DC replication " Monitoring the IM and its replicas give the bigger picture - you may actually not care about the IM’s availability as long as its replicas are happy Holistic approach, used by Orchestrator

! ! ! ! Dead intermediate master IM unreachable, its replicas are reachable, and are all in agreement their master is unreachable. Orchestrator’s analysis

! ! ! ! Dead master Master unreachable, its replicas are, and are all in agreement their master is unreachable. Orchestrator’s analysis

! ! ! ! Dead master & some replicas Master unreachable, some of its replicas are, and are all in agreement their master is unreachable. Other replicas are unreachable. Orchestrator’s analysis

! ! ! ! Locked master Master is reachable, but does not execute writes. - all replicas are in agreement that master is reachable - no replica is making progress can be handled as a failed master case Orchestrator’s analysis (pending)

How people build software # Recovery & promotion constraints •
You’ve made the decision to promote a new master • Which one? • Are all options valid? • Is the current state what you think the current state is? 18 #

! Promotion constraints most up to date less up to date delayed 24 hours You wish to promote the most up to date replica, otherwise you give up on any replica that is more advanced

! Promotion constraints log_slave_updates log_slave_updates No binary logs You must not promote a replica that has no binary logs, or without log_slave_updates

! Promotion constraints DC1 DC1 DC2 DC1 You prefer to promote a replica from same DC as failed master

! Promotion constraints SBR SBR SBR RBR You must not promote Row Based Replication server on top of Statement Based Replication

! 5.6 5.6 5.6 5.7 Promotion constraints Promoting 5.7 means losing 5.6 (replication not forward compatible) So Perhaps worth losing the 5.7 server?

! 5.6 5.6 5.7 5.7 Promotion constraints But if most of your servers are 5.7, and 5.7 turns to be most up to date, better promote 5.7 and drop the 5.6 Orchestrator handles this logic and prioritizes promotion candidates by overall count and state of replicas

! Promotion constraints,  real life! most up to date,  DC2 less up to date,   DC1 no binary logs,   DC1 DC1 Orchestrator can promote one, non-ideal replica, have the rest of the replicas converge, and then refactor again, promoting an ideal server

How people build software # 26 # Ways to avoid
promotion constraints mess Make sure ﬁrst replication tier is consistent,  Have variety on 2nd tier ! ! ! ! ! ! ! 5.6 5.7 5.7 5.6 5.6 5.6

! 5.6 5.6, semi-sync 5.7 5.7 Use semi-sync on designated servers. They will be most up-to-date upon failure Ways to avoid promotion constraints mess

! 5.6 5.6 5.7 5.7 Solve the problem by aligning relay logs on   the replicas upon master failure. • That’s what MHA does • Work In Progress: Orchestrator to support this!  Will require passwordless SSH Ways to avoid promotion   constraints mess %%%%  %%%%  %%%% %%%%  %%%%  %%%% %%%%  %%%%  %%%%

! ! ! ! Limbos Master failed; one replica lost along. Recovery went well. What happens when master is back alive?

! Limbos What will promoted master say? What will lost replica say? What will lost master say? OHAI ! Give me traﬃc ! VIP is mine! Also, good for traﬃc!

! Solving limbos • Orchestrator forcibly breaks   replication on lost replica • RESET SLAVE ALL or forced detach master on promoted replica • read_only=1 on old master, if possible • iptables on old master Master_host:  //old.master.com Can’t ﬁnd coordinates! Read only!

! ! ! ! DC split brain DC1 DC2 You’re dead! I can’t hear you! You’re dead! " " They’re dead! They’re dead!

How people build software # 33

How people build software # Flapping & rolling failovers •
The master is diagnosed as being dead • A new master is promoted • Turns out some app client is killing it • Rolling failover • What does happen to a dead master that comes back alive? 34 #

Orchestrator sets a minimal interval between two automated failovers • First one is automated; an immediate one following gets blocked • A human acknowledging the ﬁrst failover implicitly resets. Good to go for next automated failover. • And a human can always command a failover. 35 #

Orchestrator marks a failed master as downtimed • Even if said server is back in the game (human intervention), this particular server will not be failed over in the duration of the downtime. • A human can terminate the downtime 36 #

! ! ! ! Recap: how orchestrator performs master failover • Detection: everyone agrees the master is dead • Is this incident muted? • Has this cluster just recently recovered from another failure without ack?

! ! ! ! Recap: how orchestrator performs master failover • Pick most up to date replica which will also make for least lost servers  (the two are not necessarily the same) most up to date

How people build software # ! 39 # ! !
! ! ! ! Recap: how orchestrator performs master failover • Refactor topology • Oh wait,   actually, now that everything’s connected, is there a better server to promote? • Go for it, refactor again • Mark old master as downtimed • Detach promoted master from old master

How people build software # ! 40 # ! !
! Recap: how orchestrator performs master failover • Invoke external hooks • Orchestrator does not use nor imply a speciﬁc service discovery technique • Your own app/scripts to change VIP/ CNAME/Zk entries/Proxy/whatever

How people build software # Visibility & control • Flapping
and rolling failovers are avoided by having memory of past/recent events • Orchestrator audits: • Detection • Recoveries • Refactoring operations (alas without context) • Owners, reasons, internal operations… • To audit table; to orchestrator log; to syslog • Audit log available via API 41 #

How people build software # Visibility & control • Control
via: • Web interface • Web API • Command line interface • Hubot  .orc sup  > No incidents which require a failover to report.  .orc recover failed.server.com  .orc ack failed-cluster  .orc relocate this.replica below that.one  .orc graceful-takeover my-cluster 42 #

How people build software # # Configuration vs. State based
recoveries 43 $ • You designate specific roles to specific servers  i.e. this server will have to be promoted   or these are the relevant servers, these are not  • You must then match your operations to those dictated rules.  • Any change you make (provision, deprovision, relocate, …)   must be reflected in configuration • Implies chef/puppet deploy; reload of services In configuration based recoveries:

How people build software # # Conﬁguration vs. State based
recoveries 44 % • You trust the tooling to make the best of a situation  • Basically do whatever a human would do  • You still want to have roles for your servers • chef/puppet may still be involved • But those can be added/removed dynamically,   and the tooling adapts to change of state In state based recoveries:

How people build software # Orchestrator’s detection reliability • There
is no n-nines number • Orchestrator has proven to be very accurate, in production environments • Depending on both orchestrator & MySQL conﬁguration, detection may take ~5-10 seconds 45 #

How people build software # 46 # ! ! Orchestrator
HA MYSQL PROXY LAYER HTTP PROXY LAYER Backend DB " " " "Leader Orchestrator services & Orchestrator is highly available • Supports multiple services competing for leadership • Requires highly available backend database. Supports master-master setup, and guarantees it to be collision free

How people build software # Recent developments • Binary log
indexing: makes for Pseudo-GTID matching within 1s-2s. Reduced recovery time • Planned master takeover, forced master takeover • Smarter promotion rules • Fuzzy names (it’s the simple stuﬀ that makes life happier) • SSL (Square contributions) • Better master-master support • Replication structure analysis • MIT license! (thanks @Outbrain) 47 #

How people build software # What’s on the roadmap? Ongoing,
intended • Relay log alignment • Semi-sync (currently via contributions) Likely • Failure detection consensus /   leadership handover Maybe • orchestrator-agent xtrabackup Always • Reliability, performance, simpliﬁcation 48 #

How people build software # What’s on the roadmap? GitHub
commitment to Orchestrator • We use it, we will make it better • Currently merging changes upstream • GitHub will become upstream • Better documentation, tutorials, sample public AMI • World domination Open and grateful for Contributions! Please discuss via Issues beforehand 49 #

How people build software # Orchestrator/related talks • Choosing a
MySQL HA solution today  Michael Patrick (Percona)  Tuesday 19, 5:15pm • Orchestrator at Square  John Cesario, Grier Johnson, Brian Ip (Square)  Thursday 21, 3:00pm 50 #

How people build software # GitHub talks • Tutorial: MySQL
GTID Implementation, Maintenance, and Best Practices  Gillian Gunson (GitHub), Brian Cain (Dropbox), Mark Filipi (SurveyMonkey), Monday 18, 9:30am • Growing MySQL at GitHub  Tom Krouper, Jonah Berquist  Wednesday 20, 1:00pm • Rookie DBA Mistakes: How I Screwed Up So You Don't Have To  Gillian Gunson  Thursday 21, 12:50pm • Co-speaking: Dirty Little Secrets  Jonah Berquist, Shlomi Noach  Thursday 21, 3:00pm 51 #

How people build software # " Thank you! Questions? github.com/shlomi-noach

Reliable Crash Detection and Failover with Orch...

Reliable Crash Detection and Failover with Orchestrator

More Decks by Shlomi Noach

Other Decks in Programming

Featured

Transcript