Slide 1

Slide 1 text

MySQL 
 High Availability 
 at GitHub Shlomi Noach GitHub 2018

Slide 2

Slide 2 text

Agenda MySQL @ GitHub The HA story orchestrator Old vs. new design Testing Thoughts

Slide 3

Slide 3 text

About me @github/database-infrastructure Author of orchestrator, gh-ost, freno, ccql and others. Blog at http://openark.org 
 github.com/shlomi-noach
 @ShlomiNoach

Slide 4

Slide 4 text

GitHub Built for Developers Largest open source hosting 67M repositories, 24M users Critical path in build flows Best octocat T-Shirts and stickers

Slide 5

Slide 5 text

MySQL at GitHub Stores all the metadata: users, repositories, 
 commits, comments, issues, pull requests, … Serves web, API and auth traffic MySQL 5.7, semi-sync replication, RBR, cross DC ~15 TB of MySQL tables ~150 production servers, ~15 clusters Availability is critical

Slide 6

Slide 6 text

MySQL High Availability We wish to have: Automation, reliable detection, DC tolerant failovers, DC isolation tolerance, reasonable failover time, reliable failover, lossless where possible.

Slide 7

Slide 7 text

MySQL High Availability Write HA/read HA

Slide 8

Slide 8 text

MySQL High Availability Detection Recovery Master discovery

Slide 9

Slide 9 text

orchestrator

Slide 10

Slide 10 text

orchestrator, meta Adopted, maintained & supported by GitHub, 
 github.com/github/orchestrator Previously at Outbrain and Booking.com Orchestrator is free and open source, released under the Apache 2.0 license
 github.com/github/orchestrator/releases Gaining wider adoption ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Slide 11

Slide 11 text

orchestrator Discovery
 Probe, read instances, build topology graph, attributes, queries Refactoring
 Relocate replicas, manipulate, detach, reorganize Recovery
 Analyze, detect crash scenarios, structure warnings, failovers, promotions, acknowledgements, flap control, downtime, hooks ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Slide 12

Slide 12 text

orchestrator @ GitHub orchestrator/raft deployed on 3 DCs Automated failover for masters and intermediate masters Chatops integration !

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

How is orchestrator different? Holistic approach to failure detection State based, elaborate recovery decision making !

Slide 15

Slide 15 text

Detection: naive approach Probe the master Test failure? Try again n times, interval i ! !

Slide 16

Slide 16 text

Detection: holistic approach orchestrator: Probe the master and its replicas Expect agreement Agreement achieved? The cluster is de-facto down. ! ! ! ! !

Slide 17

Slide 17 text

Configuration indicates which servers whitelisted or blacklisted. Production operations must reflect in configuration changes. Promote the most up-to-date replica. Promotion: naive assumptions ! ! ! ! 5.6
 SBR 5.7
 RBR 5.7
 RBR, must_not DC1

Slide 18

Slide 18 text

orchestrator: Recognizes environments are dynamic Understands replication rules Resolves version, DC, config, promotion rules Acts based on state Promotion constraints: real life ! ! ! ! 5.6
 SBR 5.7
 RBR 5.7
 RBR, must_not DC1

Slide 19

Slide 19 text

orchestrator can promote one, non-ideal replica, have the rest of the replicas converge, 
 
 and then refactor again, promoting an ideal server. Promotion constraints: real life ! ! ! ! most up-to-date
 DC2 less up-to-date
 DC1 No binary logs
 DC1 DC1

Slide 20

Slide 20 text

Earlier master discovery 
 @ GitHub VIP + DNS based

Slide 21

Slide 21 text

" " " ! ! ! ! ! ! ! ! ! ! ! ! ! ! app ⋆ ⋆ ⋆ DNS DNS ! ! ! orchestrator Master discovery via VIP+DNS

Slide 22

Slide 22 text

Earlier master discovery 
 @ GitHub Cooperative, long, not a good cross-DC story

Slide 23

Slide 23 text

GLB/HAProxy anycast Consul orchestrator semi-synchronous replication A better story

Slide 24

Slide 24 text

orchestrator/Consul/GLB(HAProxy) @ GitHub $ $ $ ! ! ! ! ! ! ! ! ! ! ! ! ! ! glb/proxy % Consul * n " " " % Consul * n ! ! ! app orchestrator/
 raft

Slide 25

Slide 25 text

A better story More components, but less moving parts. Better ownership Decoupling

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

GLB High available, scalable proxy array Lossless reloads, implicit SSL, consul integration GLB director, load balancer array via HAProxy
 https://githubengineering.com/introducing-glb/
 https://githubengineering.com/glb-part-2-haproxy-zero-downtime-zero-delay-reloads- with-multibinder/


Slide 28

Slide 28 text

Consul By HashiCorp
 https://consul.io/ Mozilla Public License 2.0
 https://github.com/hashicorp/consul

Slide 29

Slide 29 text

Consul Service Discovery Health checks, DNS, KV storage Highly available

Slide 30

Slide 30 text

consul-template Simple template engine Listens to Consul updates

Slide 31

Slide 31 text

orchestrator/raft A highly available orchestrator setup Self healing Cross DC Mitigates DC partitioning ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Slide 32

Slide 32 text

orchestrator/raft 2 nodes per DC + mediator more to come Added raft features: • Step down • Yield • SQLite log store !

Slide 33

Slide 33 text

orchestrator/Consul/GLB(HAProxy) @ GitHub $ $ $ ! ! ! ! ! ! ! ! ! ! ! ! ! ! glb/proxy % Consul * n " " " % Consul * n ! ! ! app orchestrator/
 raft

Slide 34

Slide 34 text

orchestrator owns recovery, updates Consul consul-template runs on GLB servers, reconfigures & reloads GLB GLB reroutes connections Hard-kill old connections Apps connect via anycast, route through local GLB Independent Consul deployments per DC are managed by orchestrator/raft orchestrator/Consul/GLB(HAProxy) @ GitHub

Slide 35

Slide 35 text

semi-synchronous replication Lossless, best effort 500ms timeout Effectively picks our ideal candidates

Slide 36

Slide 36 text

Results Reliable detection Recovery in: 10s - 13s (total outage time) , normal case 15s - 20s, difficult case 25s, rare

Slide 37

Slide 37 text

Cons App identity unkown Distributed system, calls for a variety of scenarios STONITH, work in progress

Slide 38

Slide 38 text

Testing Testing cluster in production environment Continuously kill/block/reject

Slide 39

Slide 39 text

Thoughts STONITH Retries orchestrator + Consul + proxy as appliance Kubernetes

Slide 40

Slide 40 text

Questions? github.com/shlomi-noach @ShlomiNoach Thank you!