Slide 1

Slide 1 text

How people build software ! MySQL Infrastructure Testing Automation 
 @ GitHub Tom Krouper, Shlomi Noach GitHub Percona Live Europe 2017 1 !

Slide 2

Slide 2 text

How people build software ! Agenda • About • MySQL @ GitHub • Automation • Backup/restores • Failovers • Schema migrations 2 !

Slide 3

Slide 3 text

How people build software ! About me • Database Infrastructure Engineer • aka Sr. DIE • Working on MySQL since 2003 (MySQL 4.0 release era) • Worked at Twitter, Booking, and Box previous to GitHub • Enjoy medium walks on the beach, as long as there aren't too many shells. (This is not to degrade shell scripts, I like those.) github.com/tomkrouper @CaptainEyesight 3 !

Slide 4

Slide 4 text

How people build software ! About me • Infrastructure engineer at GitHub • Member of the database-infrastructure team • MySQL community member • Author of orchestrator, gh-ost, common_schema, freno, ccql and other open source tools. • Blog at openark.org github.com/shlomi-noach @ShlomiNoach 4 !

Slide 5

Slide 5 text

How people build software ! 5 • The world’s largest Octocat T-shirt and stickers store • And water bottles • And hoodies • We also do stuff related to things • Word is new swag is coming up GitHub

Slide 6

Slide 6 text

How people build software ! GitHub • 66M repositories • 24M developers • 117K businesses • More than a million teams • World’s largest open source hosting • Alexa top 100 • Critical path in build flows 6 !

Slide 7

Slide 7 text

How people build software ! MySQL at GitHub • GitHub stores repositories in git, and uses MySQL as the backend database for all related metadata: • Repository metadata, users, issues, pull requests, comments etc. • Website/API/Auth/more all use MySQL. • We run a few (growing number of) clusters, totaling over 100 MySQL servers. • The setup isn’t very large but very busy. 7 !

Slide 8

Slide 8 text

How people build software ! MySQL at GitHub • Our MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA • Availability issues must be handled quickly, as automatically as possible. 8 !

Slide 9

Slide 9 text

How people build software ! github/database-infrastructure • @ggunson, @jessbreckenridge, @jonahberquist, @shlomi-noach, @tomkrouper • (We’re growing!) • We’re concerned with: • Data availability • Data integrity 9 !

Slide 10

Slide 10 text

How people build software ! Testing 10 !

Slide 11

Slide 11 text

How people build software ! Backups/restores that ^ 11

Slide 12

Slide 12 text

How people build software ! Your data It’s important 12 !

Slide 13

Slide 13 text

How people build software ! Restores • Dedicated restore servers. • One per cluster. • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. 13 !

Slide 14

Slide 14 text

How people build software ! 14 ! ! ! ! ! production replicas auto-restore replica master ! auto-restore replicas """""" backup replica

Slide 15

Slide 15 text

How people build software ! Restores • New host provisioning uses same flow as restore. • A human may kick a restore/reclone manually. • Chatops: 
 .mysql backup-restore -H restore.this.host -r 15 !

Slide 16

Slide 16 text

How people build software ! Restore failure • A specific backup/restore may fail because computers. • No reason for panic. • Previous backup/restores proven to be working • At most we lose time • Two sequential failures, or failures across clusters are incidents to be investigated 16 !

Slide 17

Slide 17 text

How people build software ! Restore: delayed replica • One delayed replica per cluster • Lagging at 4 hours • Chatops: .mysql panic 17 !

Slide 18

Slide 18 text

How people build software ! Failovers ^ that, too 18

Slide 19

Slide 19 text

How people build software ! MySQL setup @ GitHub • Plain-old single writer master-replicas asynchronous replication. • Not yet semi-sync • Cross DC, multiple data centers • 5.7, RBR • Servers with special roles: production replica, backup, auto-restore, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing 19 !

Slide 20

Slide 20 text

How people build software ! Points of failure • Master failure, sev1 • Intermediate masters failure 20 ! ! ! ! ! ! ! ! ! !

Slide 21

Slide 21 text

How people build software ! orchestrator • Topology discovery • Refactoring • Failovers for masters and intermediate masters • Open source, Apache 2 license • github.com/github/orchestrator 21 !

Slide 22

Slide 22 text

How people build software ! orchestrator failovers @ GitHub • Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet 22 !

Slide 23

Slide 23 text

How people build software ! Testing cluster • Dedicated testing cluster in production • Does not take production traffic • “load-test” traffic • Resembles a production topology: • OS, MySQL Versions • Data centers • Server roles • DNS • Proxy • Used for many of our deployment tests 23 !

Slide 24

Slide 24 text

How people build software ! Failover testing • Multiple times per day: • Setup the cluster in desired topology layout • Inject failure (kill/block/reject) • Wait, expect recovery • Check topology: • Expect new master, correct DNS changes, replica capacity, … • Restore old master from backup • (an implicit backup/restore test) • “success/failure” event 24 !

Slide 25

Slide 25 text

How people build software ! Failover in production • We expect < 30s failover • Intermediate master failover has low impact on subset of users, depending on cluster/DC/server • Master failover implies outage • Planned master switchover takes a few seconds 25 !

Slide 26

Slide 26 text

How people build software ! A moment of reflection 26

Slide 27

Slide 27 text

How people build software ! What builds trust in failovers? A testing environment? 27 !

Slide 28

Slide 28 text

How people build software ! Chaos testing in production • First steps into regular testing • Manual • Supported by our peers • Learning, understanding impact 28 !

Slide 29

Slide 29 text

How people build software ! Tests that go wrong • Many things can go wrong • Corrupt replication • Invalidated servers • Unassigned DNS • Cleanups 29 !

Slide 30

Slide 30 text

How people build software ! Schema migrations 30

Slide 31

Slide 31 text

How people build software ! Is your data correct? The data you see is merely a ghost of your original data 31 !

Slide 32

Slide 32 text

How people build software ! gh-ost • Young. 1yr old. • In production at GitHub since born. • Software • Bugs • Development • Bugs 32

Slide 33

Slide 33 text

How people build software ! gh-ost testing • gh-ost works perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables 33

Slide 34

Slide 34 text

How people build software ! gh-ost testing servers • Dedicated servers that run continuous tests 34

Slide 35

Slide 35 text

How people build software ! 35 ! ! ! # ! ! production replicas testing replica master ! gh-ost testing replicas ! ! ! # ! ! production replicas testing replica master !

Slide 36

Slide 36 text

How people build software ! gh-ost testing • Trivial ENGINE=INNODB migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table 36

Slide 37

Slide 37 text

How people build software ! gh-ost development cycle • Work on branch
 .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch 37

Slide 38

Slide 38 text

How people build software ! Conclusion • Backup & restore • Failovers • Schema migrations 38

Slide 39

Slide 39 text

How people build software ! Thank you! Questions? github.com/tomkrouper @CaptainEyesight github.com/shlomi-noach, 
 @ShlomiNoach 39 !