Managing and Visualizing your Replication Topologies with Orchestrator
Introducing Orchestrator: a MySQL replication topology management service, that greatly simplifies DBA's tasks and enhances visibility on your topologies.
source • Designed to be as generic as possible • Some company specific rules or processes externalized via configuration https://github.com/outbrain/orchestrator 3
servers per topology, spanning multiple data centers; with the periodic server failures and movements, - Do you know how your topologies look like? - Does management know? • With the complexity of moving slaves around the topology; the rules allowing/disallowing server X to replication from Y; with the implications of cross-DC traffic on slave latency, - Who in your company can refactor your topologies other than yourself? • In the event of server failure, master or intermediate master breakage, - Do you have a clear visual into what fails? - What kind of solutions do you use? - Who can execute failover / override a failover / understand what’s going on? 4
• GTID (Oracle + MariaDB) • binlog servers • Knows the rules for replication X from Y • Will refactor your topology for you: safely redesign your topology • Fine grained control or “just do it for me, I’m too tired to think” • Can refactor via slick web UI • Or via nerdy command line interface 11
• Uses holistic approach to detect failures http://code.openark.org/blog/mysql/what-makes-a-mysql-server-failurerecovery-case • If replication breaks, orchestrator knows what the expected topology looked like • And can recommend “the next best option”, based on state, not on configuration • And, if you like, can execute an automated/manual failover that heals your topology and leaves no slave (or only those utterly incapable of restoring) behind 12
a slave connects to a master, it looks for the last GTID statement it already executed • Available in Oracle MySQL 5.6, MariaDB 10.0 • Completely different implementations; may cause lockup • 5.6 migration path is painful (alleviated in 5.7) • 5.6 requires binary logs & log-slave-updates enabled on all slaves (alleviated in 5.7) • 5.6 errant transactions, unexecuted sequences • GTID will be the requirement in future Oracle features • MariaDB GTID supports domains; easy to use 18
GTID. This includes: • Slave repointing • Failover schemes • With less requirements • Bulk operations • Without upgrading your servers; without installing anything on them; in short: not touching your beloved existing setup • No vendor lockdown; no migration paths 19
identified statement every X seconds. We call it Pseudo GTID. • Pseudo GTID statements are searchable and identifiable in binary and relay logs • Make for “markers” in the binary/relay logs • Injection can be made via MySQL event scheduler or externally • Otherwise non intrusive. No changes to topology/versions/methodologies 20
schedule every 5 second starts current_timestamp on completion preserve enable do begin set @pseudo_gtid_hint := uuid(); set @_create_statement := concat('drop ', 'view if exists `meta`.`_pseudo_gtid_hint__', @pseudo_gtid_hint, '`'); PREPARE st FROM @_create_statement; EXECUTE st; DEALLOCATE PREPARE st; end $$ 21
binary logs • Under same name and position • Nested binlog servers allow for simplified refactoring and offer a simplified & faster master recovery mechanism • See Binlog Servers @ Booking.com https://www.percona.com/live/europe-amsterdam-2015/sessions/binlog-servers-bookingcom • Orchestrator supports: • hybrid standard + binlog-server replication topologies • Pure binlog server topologies Binlog Servers 29
Web API • Polls servers, checks for crashes, recovers, periodic operations • Leader election • Can run as command line • Issue a single command & exit • Requires (same, single) MySQL backend for any operation. • Backend database has the state of topologies • orchestrator itself mostly stateless (pending operation excluded, optimistic mode) • Agent-less for most operations; communicates directly with MySQL instances Orchestrator architecture 30
• All locks auto expiring • Support authentication (basic-auth, reverse proxy) • Operations friendly, e.g.: • Server maintenance flag • Downtiming servers • Marking as “best candidate” Orchestrator architecture 31
choice: a lot of concurrency; easy deployment; rapid development • MySQL as backend database (duh) • go-martini web framework • Page generation via dirty JavaScript/jQuery (sue me) • Twitter bootstrap • Graphs via D3 • Development: • Github, open source; accepting pull-requests https://github.com/outbrain/orchestrator/ Orchestrator stack & development 33
shop • We have ALOT production servers on ALOT topologies (aka chains, aka clusters) • As small as 1 server per topology, as large as hundreds of servers per topology • Two major data centers, now populating our third • Single master, plenty slaves • All chains are deployed with Pseudo-GTID and controlled by orchestrator • Larger chains: hybrid, normal + binlog servers topologies (complex!) • “Pure” binlog-server topologies experimental, non-production • Some topologies sharded • A little bit of active/passive master-master 34
one is elected as leader at any given time • ALOT hosts with orchestrator as CLI • Single elected service polls all our instances • Each MySQL instance polled every 30s • Pseudo-GTID deployed on all chains • Orchestrator configured to auto-recover the death of any intermediate master • Orchestrator configured to auto recover from some master failures • Both the above happen • Some checks & dashboards rely on orchestrator data (API / DB) • Some operations rely on orchestrator logic 36
avoiding these experiments as well • Getting more people involved (on call sys admins) • ALOT of input is gained by people inexperienced with MySQL, leading to more visibility on orchestrator’s side • And of course periodic real crash scenarios 39
good, operations unsupported • Master-master-master (#nodes > 2) replication • Galera • Unrecognized by orchestrator, identifies each co-master as its own head of topology • Multi-master aka multi source (neither Oracle 5.7 nor MariaDB) • Tungsten 41
• @isamlambert for making a couple sparkles to ignite this • Team @ Booking.com for ideas, input, time testing, time using • Contributors! Image, sources & other credits
MySQL to store HTTP cookie data https://www.percona.com/live/europe-amsterdam-2015/sessions/combining-redis-and-mysql-store-http-cookie-data • Encrypted MySQL Backups and instant recoverability on large scale https://www.percona.com/live/europe-amsterdam-2015/sessions/encrypted-mysql-backups-and-instant-recoverability-large-scale • Events storage and analysis with Riak at Booking.com https://www.percona.com/live/europe-amsterdam-2015/sessions/events-storage-and-analysis-riak-bookingcom • Riding the Binlog: an in Deep Dissection of the Replication Stream https://www.percona.com/live/europe-amsterdam-2015/sessions/riding-binlog-deep-dissection-replication-stream • Unicode and MySQL https://www.percona.com/live/europe-amsterdam-2015/sessions/unicode-and-mysql • Your Clone Army: Better scalability through more database servers https://www.percona.com/live/europe-amsterdam-2015/sessions/your-clone-army-better-scalability-through-more-database-servers • The CIS MySQL Security Benchmark (LT) https://www.percona.com/live/europe-amsterdam-2015/sessions/cis-mysql-security-benchmark • The Virtues of Boring Technology (Keynote) https://www.percona.com/live/europe-amsterdam-2015/sessions/virtues-boring-technology Other Booking.com talks