Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MySQL Infrastructure Testing Automation at GitHub

Shlomi Noach
October 05, 2017

MySQL Infrastructure Testing Automation at GitHub

The database team at GitHub is tasked with keeping the data available and with maintaining its integrity. Our infrastructure automates away much of our operation, but automation requires trust, and trust is gained by testing. This session highlights three examples of infrastructure testing automation that helps us sleep better at night:

- Backups: scheduling backups; making backup data accessible to our engineers; auto-restores and backup validation. What metrics and alerts we have in place.
- Failovers: how we continuously test our failover mechanism, orchestrator. How we setup a failover scenario, what defines a successful failover, how we automate away the cleanup. What we do in production.
- Schema migrations: how we ensure that gh-ost, our schema migration tool, which keeps rewriting our (and your!) data, does the right thing. How we test new branches in production without putting production data at risk.

Shlomi Noach

October 05, 2017
Tweet

More Decks by Shlomi Noach

Other Decks in Technology

Transcript

  1. How people build software ! MySQL Infrastructure Testing Automation 


    @ GitHub Tom Krouper, Shlomi Noach GitHub Percona Live Europe 2017 1 !
  2. How people build software ! Agenda • About • MySQL

    @ GitHub • Automation • Backup/restores • Failovers • Schema migrations 2 !
  3. How people build software ! About me • Database Infrastructure

    Engineer • aka Sr. DIE • Working on MySQL since 2003 (MySQL 4.0 release era) • Worked at Twitter, Booking, and Box previous to GitHub • Enjoy medium walks on the beach, as long as there aren't too many shells. (This is not to degrade shell scripts, I like those.) github.com/tomkrouper @CaptainEyesight 3 !
  4. How people build software ! About me • Infrastructure engineer

    at GitHub • Member of the database-infrastructure team • MySQL community member • Author of orchestrator, gh-ost, common_schema, freno, ccql and other open source tools. • Blog at openark.org github.com/shlomi-noach @ShlomiNoach 4 !
  5. How people build software ! 5 • The world’s largest

    Octocat T-shirt and stickers store • And water bottles • And hoodies • We also do stuff related to things • Word is new swag is coming up GitHub
  6. How people build software ! GitHub • 66M repositories •

    24M developers • 117K businesses • More than a million teams • World’s largest open source hosting • Alexa top 100 • Critical path in build flows 6 !
  7. How people build software ! MySQL at GitHub • GitHub

    stores repositories in git, and uses MySQL as the backend database for all related metadata: • Repository metadata, users, issues, pull requests, comments etc. • Website/API/Auth/more all use MySQL. • We run a few (growing number of) clusters, totaling over 100 MySQL servers. • The setup isn’t very large but very busy. 7 !
  8. How people build software ! MySQL at GitHub • Our

    MySQL servers must be available, responsive and in good state • GitHub has 99.95% SLA • Availability issues must be handled quickly, as automatically as possible. 8 !
  9. How people build software ! github/database-infrastructure • @ggunson, @jessbreckenridge, @jonahberquist,

    @shlomi-noach, @tomkrouper • (We’re growing!) • We’re concerned with: • Data availability • Data integrity 9 !
  10. How people build software ! Restores • Dedicated restore servers.

    • One per cluster. • Continuously restores, catches up with replication, restores, catches up with replication, restores, … • Sending a “success” event at the end of each cycle. • We monitor for number of “success” events in past 24-ish hours, per cluster. 13 !
  11. How people build software ! 14 ! ! ! !

    ! production replicas auto-restore replica master ! auto-restore replicas """""" backup replica
  12. How people build software ! Restores • New host provisioning

    uses same flow as restore. • A human may kick a restore/reclone manually. • Chatops: 
 .mysql backup-restore -H restore.this.host -r 15 !
  13. How people build software ! Restore failure • A specific

    backup/restore may fail because computers. • No reason for panic. • Previous backup/restores proven to be working • At most we lose time • Two sequential failures, or failures across clusters are incidents to be investigated 16 !
  14. How people build software ! Restore: delayed replica • One

    delayed replica per cluster • Lagging at 4 hours • Chatops: .mysql panic 17 !
  15. How people build software ! MySQL setup @ GitHub •

    Plain-old single writer master-replicas asynchronous replication. • Not yet semi-sync • Cross DC, multiple data centers • 5.7, RBR • Servers with special roles: production replica, backup, auto-restore, migration-test, analytics, … • 2-3 tiers of replication • Occasional cluster split (functional sharding) • Very dynamic, always changing 19 !
  16. How people build software ! Points of failure • Master

    failure, sev1 • Intermediate masters failure 20 ! ! ! ! ! ! ! ! ! !
  17. How people build software ! orchestrator • Topology discovery •

    Refactoring • Failovers for masters and intermediate masters • Open source, Apache 2 license • github.com/github/orchestrator 21 !
  18. How people build software ! orchestrator failovers @ GitHub •

    Automated master & intermediate master failovers for all clusters. • On failover, runs GitHub-specific hooks • Grabbing VIP/DNS • Updating server role • Kicking services (e.g. pt-heartbeat) • Notifying chat • Running puppet 22 !
  19. How people build software ! Testing cluster • Dedicated testing

    cluster in production • Does not take production traffic • “load-test” traffic • Resembles a production topology: • OS, MySQL Versions • Data centers • Server roles • DNS • Proxy • Used for many of our deployment tests 23 !
  20. How people build software ! Failover testing • Multiple times

    per day: • Setup the cluster in desired topology layout • Inject failure (kill/block/reject) • Wait, expect recovery • Check topology: • Expect new master, correct DNS changes, replica capacity, … • Restore old master from backup • (an implicit backup/restore test) • “success/failure” event 24 !
  21. How people build software ! Failover in production • We

    expect < 30s failover • Intermediate master failover has low impact on subset of users, depending on cluster/DC/server • Master failover implies outage • Planned master switchover takes a few seconds 25 !
  22. How people build software ! Chaos testing in production •

    First steps into regular testing • Manual • Supported by our peers • Learning, understanding impact 28 !
  23. How people build software ! Tests that go wrong •

    Many things can go wrong • Corrupt replication • Invalidated servers • Unassigned DNS • Cleanups 29 !
  24. How people build software ! Is your data correct? The

    data you see is merely a ghost of your original data 31 !
  25. How people build software ! gh-ost • Young. 1yr old.

    • In production at GitHub since born. • Software • Bugs • Development • Bugs 32
  26. How people build software ! gh-ost testing • gh-ost works

    perfectly well on our data • Tested, re-tested, and tested again • Full coverage of production tables 33
  27. How people build software ! 35 ! ! ! #

    ! ! production replicas testing replica master ! gh-ost testing replicas ! ! ! # ! ! production replicas testing replica master !
  28. How people build software ! gh-ost testing • Trivial ENGINE=INNODB

    migration • Stop replication • Cut-over, cut-back • Checksum both tables, compare • Checksum failure: stop the world, alert • Success/failure: event • Drop ghost table • Catch up • Next table 36
  29. How people build software ! gh-ost development cycle • Work

    on branch
 .deploy gh-ost/mybranch to prod/mysql_role=ghost_testing • Let continuous tests run • Depending on nature of change, observe hours/days/more. • Merge • Tests run regardless of deployed branch 37
  30. How people build software ! Conclusion • Backup & restore

    • Failovers • Schema migrations 38