Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The MySQL Ecosystem at GitHub

Sam Lambert
November 04, 2014

The MySQL Ecosystem at GitHub

A talk I gave at Percona Live London.

Sam Lambert

November 04, 2014
Tweet

More Decks by Sam Lambert

Other Decks in Technology

Transcript

  1. THE MYSQL ECOSYSTEM
    AT GITHUB

    View full-size slide

  2. SAM LAMBERT
    LEAD ENGINEER @ GITHUB
    github.com/samlambert
    samlambert.com
    twitter.com/isamlambert
    !
    "
    #

    View full-size slide

  3. WHAT IS
    GITHUB?

    View full-size slide

  4. GITHUB
    > code hosting
    > collaboration
    > octocats

    View full-size slide

  5. GITHUB
    > 6+ million users
    > 15.7 million repositories
    > 100+ tb of git data
    > 239 githubbers
    > 100 engineers

    View full-size slide

  6. GITHUB
    > proudly powered by mysql

    View full-size slide

  7. github.com/mysql/mysql-server

    View full-size slide

  8. infrastructure
    > small team ~ 15 people
    > responsible for scaling, automation,
    pager rotation, git storage and site
    reliability
    > sub team: the database infrastructure
    team
    > shout out to @dbussink

    View full-size slide

  9. the github
    stack

    View full-size slide

  10. the stack
    > git (obviously)
    > ruby/rails for github.com
    > c spread around the stack
    > puppet for provisioning
    > bash and ruby for scripting
    > elasticsearch for .com search
    > haystack for exceptions
    > resque for queues

    View full-size slide

  11. ruby on rails
    > github/github
    > 203 contributors
    > 192,000 commits
    > large rails app
    > active record

    View full-size slide

  12. active record
    > object relational mapper
    > avoids writing sql directly
    > can write some terrible queries
    > single DB host approach

    View full-size slide

  13. environment
    > fast changing codebase
    > hundreds of deployments a day
    > tooling is extremely important

    View full-size slide

  14. SELECT DATE_SUB(NOW(), INTERVAL 18 MONTH);

    View full-size slide

  15. > majority of queries served from
    one host
    > replicas used for backups/
    failover
    > old hardware/datacenter
    going solo

    View full-size slide

  16. > unscalable
    > contention problems
    > traffic bursts caused query
    response times to go up
    read me

    View full-size slide

  17. time for
    change

    View full-size slide

  18. > needed to move data centers
    > chance to update hardware
    > new start = a chance to tune
    > time to functionally shard
    you had me at
    hardware

    View full-size slide

  19. > a large volume of writes came
    from a single events table
    > constantly growing
    > no joins
    sharding?

    View full-size slide

  20. > replicate table do
    > move reads onto new cluster
    > then finally cut writes over
    > stop replication
    replicate

    View full-size slide

  21. > multiple clusters sharded
    functionally
    > separate concerns
    > scale writes and reads
    now there were two

    View full-size slide

  22. > events out of the way time for
    the big show
    > the main cluster was next
    main cluster

    View full-size slide

  23. > new hardware
    > ssds
    > loads of ram
    > 10gb networking
    bare metal

    View full-size slide

  24. > single master
    > lots of read replicas
    > delayed replicas
    > logical backup hosts
    > full backup hosts
    build the topology

    View full-size slide

  25. > regression testing is essential
    > replay queries from live cluster
    > long benchmarks: 4 hours +
    > one change at a time
    TESTING

    View full-size slide

  26. > maintenance window
    > 13 minutes
    go live

    View full-size slide

  27. time to use that
    hardware

    View full-size slide

  28. start
    master
    replica replica replica
    apps

    View full-size slide

  29. new design
    master
    replica replica replica
    apps
    haproxy

    View full-size slide

  30. app changes
    how do you transition a
    monolithic app to use multiple
    database hosts?

    View full-size slide

  31. connections
    > split out the current connection
    > write
    > read only

    View full-size slide

  32. GET
    > we made the decision to have all
    get requests use a replica

    View full-size slide

  33. POST
    > all posts and gets after a post
    for a user use the master
    > after 3 seconds the user moves
    to a replica

    View full-size slide

  34. refactoring
    > we wanted to take the smallest
    steps possible each time
    > we verified our changes at each
    step in the process

    View full-size slide

  35. write alerts
    > how do we know we aren’t going
    to break anything?
    > we set up a connection we called
    “write alert”
    > write alert allowed writes but
    notified us

    View full-size slide

  36. haystack
    > haystack is our exception
    tracking tool
    > backed by elasticsearch
    > awesome

    View full-size slide

  37. write alerts

    View full-size slide

  38. write alerts

    View full-size slide

  39. write alerts
    > this allowed us to test moving to
    a read only connection without
    impacting users
    > we fixed any issues that came up
    > when we stopped getting alerts
    we knew we were ready to go read
    only

    View full-size slide

  40. > we staff ship features and
    changes to help us gain confidence
    staff shipping

    View full-size slide

  41. haproxy
    > needed a way of distributing
    queries among replicas
    > plenty of prior art

    View full-size slide

  42. haproxy
    > we created haproxy pairs for ha
    and failover

    View full-size slide

  43. gitauth
    > we started with a subset of our
    app
    > a proxy that checks you have
    permissions to push and pull to a
    repo
    > read intensive

    View full-size slide

  44. %
    > slow ramp up
    > 1%
    > 5%

    View full-size slide

  45. heartbeat
    > permissions are replication
    sensitive
    > pt-heartbeat
    > gitauth checks
    > 1 second of delay = move back to
    the master

    View full-size slide

  46. build confidence
    > rest of the app had to follow
    > keep upping the %

    View full-size slide

  47. PSUs
    > parts go
    > more parts to keep github up

    View full-size slide

  48. clients
    > pause the request
    > reconnect through the proxy

    View full-size slide

  49. performance
    degradation

    View full-size slide

  50. keeping an eye
    > graphing at github is awesome
    > shout out to @jssjr github.com/jssjr

    View full-size slide

  51. increase in latency
    > we noticed an upward trend in
    latency

    View full-size slide

  52. multi process
    > hasn’t always worked well in
    the past
    > connections tended to stick to a
    process

    View full-size slide

  53. kernel
    > upgrades were required for
    better balance

    View full-size slide

  54. slow and steady
    > deploy app to use upgraded
    secondary haproxy
    > roll through the cluster

    View full-size slide

  55. the
    down sides

    View full-size slide

  56. hurry up
    > replication delay is painful
    > be careful where you can
    tolerate delay

    View full-size slide

  57. cause
    > large updates, inserts, deletes
    > dependent destroy
    > transitions

    View full-size slide

  58. effect
    > delay is painful
    > be careful where you can
    tolerate delay

    View full-size slide

  59. remedy
    > get after a post gets a master

    View full-size slide

  60. haystack
    > we modified the app
    > when a statement modifies too
    many rows we send it to haystack
    > insight

    View full-size slide

  61. throttler
    > developers need to modify data
    > must be replication safe
    > query haproxy
    > check replicas

    View full-size slide

  62. contributions
    > email change
    > active users caused delay
    > support request
    > use the throttler

    View full-size slide

  63. keeping things
    fast

    View full-size slide

  64. tooling
    > tooling is essential
    > never underestimate the power
    of being able to write tools

    View full-size slide

  65. log it
    > we built a slow query logger into
    the app

    View full-size slide

  66. haystack pager
    > developer on call
    > a spike in needles pages someone

    View full-size slide

  67. toolbar
    > staff mode
    > see all queries on a page
    > with times
    > github.com/peek/peek

    View full-size slide

  68. tooling
    > verification and improvement

    View full-size slide

  69. slow
    transactions

    View full-size slide

  70. migrations
    > query pile up
    > site stalls
    > bad user experience

    View full-size slide

  71. observe
    > we noticed two issues:
    - table stats
    - metadata locking

    View full-size slide

  72. table stats
    > innodb_stats_on_metadata
    > innodb_stats_auto_update
    > github.com/samlambert/pt-
    online-schema-change-analyze

    View full-size slide

  73. metadata
    > queries piled up behind a
    metadata lock

    View full-size slide

  74. pt-osc
    > table copy and swap

    View full-size slide

  75. prevention
    > smaller transactions
    > detection

    View full-size slide

  76. meet hubot
    > node.js
    > open source
    > github.com/github/hubot
    > hundreds of plugins

    View full-size slide

  77. show and tell
    > it all happens in chat
    > amazing for learning
    > share the terminal

    View full-size slide

  78. anything
    > drop tables
    > see who's in the office
    > deploy apps

    View full-size slide

  79. culture
    > chat is central to our culture

    View full-size slide

  80. remote
    > 52% of github is remote
    > how do you give everyone
    context?

    View full-size slide

  81. automation
    > safe
    > intuitive
    > accessable
    > people will use it

    View full-size slide

  82. explain
    > explain queries via hubot

    View full-size slide

  83. explain
    > learn together
    > work as a team
    > no need for a meeting/email

    View full-size slide

  84. profile
    > profile queries

    View full-size slide

  85. github.com/samlambert/hubot-mysql-chatops

    View full-size slide

  86. shell
    > you do not have to write
    cofeescript!
    > 34279 lines of ruby and shell
    > wrapped by hubot

    View full-size slide

  87. truncate
    > safe
    > visible
    > repeatable

    View full-size slide

  88. backup
    > no excuse
    > available to anyone
    > uses an app called safehold

    View full-size slide

  89. safehold
    > fires backup jobs into a queue
    > workers work on different
    types of jobs

    View full-size slide

  90. restore
    > restore any logical backup
    > backups go to intermediate hosts

    View full-size slide

  91. clone
    > clone tables onto test servers
    > great for testing indexes
    > developers use this a lot

    View full-size slide

  92. proxy control
    > weight servers
    > take them from the pool

    View full-size slide

  93. deploy
    /deploy

    View full-size slide

  94. graph me
    /graph me -1h @mysql.rwps

    View full-size slide

  95. status
    > /status yellow
    > letting you all know

    View full-size slide

  96. mitigate
    > attacks happen
    > why get sad?
    > use the chatops

    View full-size slide

  97. SAM LAMBERT
    LEAD ENGINEER @ GITHUB
    github.com/samlambert
    samlambert.com
    twitter.com/isamlambert
    !
    "
    #

    View full-size slide