Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The MySQL Ecosystem at GitHub

Be9caeb9d4ef9944d151af909063ed6e?s=47 Sam Lambert
November 04, 2014

The MySQL Ecosystem at GitHub

A talk I gave at Percona Live London.

Be9caeb9d4ef9944d151af909063ed6e?s=128

Sam Lambert

November 04, 2014
Tweet

Transcript

  1. THE MYSQL ECOSYSTEM AT GITHUB

  2. SAM LAMBERT LEAD ENGINEER @ GITHUB github.com/samlambert samlambert.com twitter.com/isamlambert !

    " #
  3. WHAT IS GITHUB?

  4. GITHUB > code hosting > collaboration > octocats

  5. GITHUB > 6+ million users > 15.7 million repositories >

    100+ tb of git data > 239 githubbers > 100 engineers
  6. GITHUB > proudly powered by mysql

  7. github.com/mysql/mysql-server

  8. THE TEAM

  9. infrastructure > small team ~ 15 people > responsible for

    scaling, automation, pager rotation, git storage and site reliability > sub team: the database infrastructure team > shout out to @dbussink
  10. the github stack

  11. the stack > git (obviously) > ruby/rails for github.com >

    c spread around the stack > puppet for provisioning > bash and ruby for scripting > elasticsearch for .com search > haystack for exceptions > resque for queues
  12. ruby on rails > github/github > 203 contributors > 192,000

    commits > large rails app > active record
  13. active record > object relational mapper > avoids writing sql

    directly > can write some terrible queries > single DB host approach
  14. environment > fast changing codebase > hundreds of deployments a

    day > tooling is extremely important
  15. SELECT DATE_SUB(NOW(), INTERVAL 18 MONTH);

  16. > majority of queries served from one host > replicas

    used for backups/ failover > old hardware/datacenter going solo
  17. > unscalable > contention problems > traffic bursts caused query

    response times to go up read me
  18. time for change

  19. > needed to move data centers > chance to update

    hardware > new start = a chance to tune > time to functionally shard you had me at hardware
  20. > a large volume of writes came from a single

    events table > constantly growing > no joins sharding?
  21. > replicate table do > move reads onto new cluster

    > then finally cut writes over > stop replication replicate
  22. > multiple clusters sharded functionally > separate concerns > scale

    writes and reads now there were two
  23. > events out of the way time for the big

    show > the main cluster was next main cluster
  24. > new hardware > ssds > loads of ram >

    10gb networking bare metal
  25. > single master > lots of read replicas > delayed

    replicas > logical backup hosts > full backup hosts build the topology
  26. > regression testing is essential > replay queries from live

    cluster > long benchmarks: 4 hours + > one change at a time TESTING
  27. > maintenance window > 13 minutes go live

  28. results

  29. time to use that hardware

  30. start master replica replica replica apps

  31. master

  32. replica

  33. new design master replica replica replica apps haproxy

  34. app changes how do you transition a monolithic app to

    use multiple database hosts?
  35. connections > split out the current connection > write >

    read only
  36. GET > we made the decision to have all get

    requests use a replica
  37. POST > all posts and gets after a post for

    a user use the master > after 3 seconds the user moves to a replica
  38. refactoring > we wanted to take the smallest steps possible

    each time > we verified our changes at each step in the process
  39. write alerts > how do we know we aren’t going

    to break anything? > we set up a connection we called “write alert” > write alert allowed writes but notified us
  40. haystack > haystack is our exception tracking tool > backed

    by elasticsearch > awesome
  41. write alerts

  42. write alerts

  43. write alerts > this allowed us to test moving to

    a read only connection without impacting users > we fixed any issues that came up > when we stopped getting alerts we knew we were ready to go read only
  44. None
  45. > we staff ship features and changes to help us

    gain confidence staff shipping
  46. haproxy > needed a way of distributing queries among replicas

    > plenty of prior art
  47. haproxy > we created haproxy pairs for ha and failover

  48. gitauth > we started with a subset of our app

    > a proxy that checks you have permissions to push and pull to a repo > read intensive
  49. % > slow ramp up > 1% > 5%

  50. heartbeat > permissions are replication sensitive > pt-heartbeat > gitauth

    checks > 1 second of delay = move back to the master
  51. build confidence > rest of the app had to follow

    > keep upping the %
  52. None
  53. None
  54. failover

  55. PSUs > parts go > more parts to keep github

    up
  56. clients > pause the request > reconnect through the proxy

  57. None
  58. performance degradation

  59. keeping an eye > graphing at github is awesome >

    shout out to @jssjr github.com/jssjr
  60. increase in latency > we noticed an upward trend in

    latency
  61. None
  62. None
  63. multi process > hasn’t always worked well in the past

    > connections tended to stick to a process
  64. kernel > upgrades were required for better balance

  65. slow and steady > deploy app to use upgraded secondary

    haproxy > roll through the cluster
  66. the down sides

  67. hurry up > replication delay is painful > be careful

    where you can tolerate delay
  68. cause > large updates, inserts, deletes > dependent destroy >

    transitions
  69. effect > delay is painful > be careful where you

    can tolerate delay
  70. remedy > get after a post gets a master

  71. haystack > we modified the app > when a statement

    modifies too many rows we send it to haystack > insight
  72. None
  73. throttler > developers need to modify data > must be

    replication safe > query haproxy > check replicas
  74. contributions > email change > active users caused delay >

    support request > use the throttler
  75. None
  76. keeping things fast

  77. tooling > tooling is essential > never underestimate the power

    of being able to write tools
  78. log it > we built a slow query logger into

    the app
  79. None
  80. None
  81. haystack pager > developer on call > a spike in

    needles pages someone
  82. toolbar > staff mode > see all queries on a

    page > with times > github.com/peek/peek
  83. None
  84. None
  85. None
  86. tooling > verification and improvement

  87. slow transactions

  88. migrations > query pile up > site stalls > bad

    user experience
  89. observe > we noticed two issues: - table stats -

    metadata locking
  90. table stats > innodb_stats_on_metadata > innodb_stats_auto_update > github.com/samlambert/pt- online-schema-change-analyze

  91. metadata > queries piled up behind a metadata lock

  92. pt-osc > table copy and swap

  93. None
  94. None
  95. None
  96. None
  97. None
  98. None
  99. None
  100. prevention > smaller transactions > detection

  101. None
  102. chatops

  103. meet hubot > node.js > open source > github.com/github/hubot >

    hundreds of plugins
  104. None
  105. show and tell > it all happens in chat >

    amazing for learning > share the terminal
  106. anything > drop tables > see who's in the office

    > deploy apps
  107. culture > chat is central to our culture

  108. remote > 52% of github is remote > how do

    you give everyone context?
  109. automation > safe > intuitive > accessable > people will

    use it
  110. explain > explain queries via hubot

  111. None
  112. explain > learn together > work as a team >

    no need for a meeting/email
  113. profile > profile queries

  114. None
  115. github.com/samlambert/hubot-mysql-chatops

  116. shell > you do not have to write cofeescript! >

    34279 lines of ruby and shell > wrapped by hubot
  117. truncate > safe > visible > repeatable

  118. None
  119. None
  120. None
  121. backup > no excuse > available to anyone > uses

    an app called safehold
  122. safehold > fires backup jobs into a queue > workers

    work on different types of jobs
  123. restore > restore any logical backup > backups go to

    intermediate hosts
  124. None
  125. clone > clone tables onto test servers > great for

    testing indexes > developers use this a lot
  126. proxy control > weight servers > take them from the

    pool
  127. deploy /deploy

  128. graph me /graph me -1h @mysql.rwps

  129. None
  130. status > /status yellow <message> > letting you all know

  131. mitigate > attacks happen > why get sad? > use

    the chatops
  132. questions?

  133. SAM LAMBERT LEAD ENGINEER @ GITHUB github.com/samlambert samlambert.com twitter.com/isamlambert !

    " #