SAM LAMBERT
LEAD ENGINEER @ GITHUB
github.com/samlambert
samlambert.com
twitter.com/isamlambert
!
"
#
Slide 3
Slide 3 text
WHAT IS
GITHUB?
Slide 4
Slide 4 text
GITHUB
> code hosting
> collaboration
> octocats
Slide 5
Slide 5 text
GITHUB
> 6+ million users
> 15.7 million repositories
> 100+ tb of git data
> 239 githubbers
> 100 engineers
Slide 6
Slide 6 text
GITHUB
> proudly powered by mysql
Slide 7
Slide 7 text
github.com/mysql/mysql-server
Slide 8
Slide 8 text
THE
TEAM
Slide 9
Slide 9 text
infrastructure
> small team ~ 15 people
> responsible for scaling, automation,
pager rotation, git storage and site
reliability
> sub team: the database infrastructure
team
> shout out to @dbussink
Slide 10
Slide 10 text
the github
stack
Slide 11
Slide 11 text
the stack
> git (obviously)
> ruby/rails for github.com
> c spread around the stack
> puppet for provisioning
> bash and ruby for scripting
> elasticsearch for .com search
> haystack for exceptions
> resque for queues
Slide 12
Slide 12 text
ruby on rails
> github/github
> 203 contributors
> 192,000 commits
> large rails app
> active record
Slide 13
Slide 13 text
active record
> object relational mapper
> avoids writing sql directly
> can write some terrible queries
> single DB host approach
Slide 14
Slide 14 text
environment
> fast changing codebase
> hundreds of deployments a day
> tooling is extremely important
Slide 15
Slide 15 text
SELECT DATE_SUB(NOW(), INTERVAL 18 MONTH);
Slide 16
Slide 16 text
> majority of queries served from
one host
> replicas used for backups/
failover
> old hardware/datacenter
going solo
Slide 17
Slide 17 text
> unscalable
> contention problems
> traffic bursts caused query
response times to go up
read me
Slide 18
Slide 18 text
time for
change
Slide 19
Slide 19 text
> needed to move data centers
> chance to update hardware
> new start = a chance to tune
> time to functionally shard
you had me at
hardware
Slide 20
Slide 20 text
> a large volume of writes came
from a single events table
> constantly growing
> no joins
sharding?
Slide 21
Slide 21 text
> replicate table do
> move reads onto new cluster
> then finally cut writes over
> stop replication
replicate
Slide 22
Slide 22 text
> multiple clusters sharded
functionally
> separate concerns
> scale writes and reads
now there were two
Slide 23
Slide 23 text
> events out of the way time for
the big show
> the main cluster was next
main cluster
Slide 24
Slide 24 text
> new hardware
> ssds
> loads of ram
> 10gb networking
bare metal
Slide 25
Slide 25 text
> single master
> lots of read replicas
> delayed replicas
> logical backup hosts
> full backup hosts
build the topology
Slide 26
Slide 26 text
> regression testing is essential
> replay queries from live cluster
> long benchmarks: 4 hours +
> one change at a time
TESTING
Slide 27
Slide 27 text
> maintenance window
> 13 minutes
go live
Slide 28
Slide 28 text
results
Slide 29
Slide 29 text
time to use that
hardware
Slide 30
Slide 30 text
start
master
replica replica replica
apps
Slide 31
Slide 31 text
master
Slide 32
Slide 32 text
replica
Slide 33
Slide 33 text
new design
master
replica replica replica
apps
haproxy
Slide 34
Slide 34 text
app changes
how do you transition a
monolithic app to use multiple
database hosts?
Slide 35
Slide 35 text
connections
> split out the current connection
> write
> read only
Slide 36
Slide 36 text
GET
> we made the decision to have all
get requests use a replica
Slide 37
Slide 37 text
POST
> all posts and gets after a post
for a user use the master
> after 3 seconds the user moves
to a replica
Slide 38
Slide 38 text
refactoring
> we wanted to take the smallest
steps possible each time
> we verified our changes at each
step in the process
Slide 39
Slide 39 text
write alerts
> how do we know we aren’t going
to break anything?
> we set up a connection we called
“write alert”
> write alert allowed writes but
notified us
Slide 40
Slide 40 text
haystack
> haystack is our exception
tracking tool
> backed by elasticsearch
> awesome
Slide 41
Slide 41 text
write alerts
Slide 42
Slide 42 text
write alerts
Slide 43
Slide 43 text
write alerts
> this allowed us to test moving to
a read only connection without
impacting users
> we fixed any issues that came up
> when we stopped getting alerts
we knew we were ready to go read
only
Slide 44
Slide 44 text
No content
Slide 45
Slide 45 text
> we staff ship features and
changes to help us gain confidence
staff shipping
Slide 46
Slide 46 text
haproxy
> needed a way of distributing
queries among replicas
> plenty of prior art
Slide 47
Slide 47 text
haproxy
> we created haproxy pairs for ha
and failover
Slide 48
Slide 48 text
gitauth
> we started with a subset of our
app
> a proxy that checks you have
permissions to push and pull to a
repo
> read intensive
Slide 49
Slide 49 text
%
> slow ramp up
> 1%
> 5%
Slide 50
Slide 50 text
heartbeat
> permissions are replication
sensitive
> pt-heartbeat
> gitauth checks
> 1 second of delay = move back to
the master
Slide 51
Slide 51 text
build confidence
> rest of the app had to follow
> keep upping the %
Slide 52
Slide 52 text
No content
Slide 53
Slide 53 text
No content
Slide 54
Slide 54 text
failover
Slide 55
Slide 55 text
PSUs
> parts go
> more parts to keep github up
Slide 56
Slide 56 text
clients
> pause the request
> reconnect through the proxy
Slide 57
Slide 57 text
No content
Slide 58
Slide 58 text
performance
degradation
Slide 59
Slide 59 text
keeping an eye
> graphing at github is awesome
> shout out to @jssjr github.com/jssjr
Slide 60
Slide 60 text
increase in latency
> we noticed an upward trend in
latency
Slide 61
Slide 61 text
No content
Slide 62
Slide 62 text
No content
Slide 63
Slide 63 text
multi process
> hasn’t always worked well in
the past
> connections tended to stick to a
process
Slide 64
Slide 64 text
kernel
> upgrades were required for
better balance
Slide 65
Slide 65 text
slow and steady
> deploy app to use upgraded
secondary haproxy
> roll through the cluster
Slide 66
Slide 66 text
the
down sides
Slide 67
Slide 67 text
hurry up
> replication delay is painful
> be careful where you can
tolerate delay
Slide 68
Slide 68 text
cause
> large updates, inserts, deletes
> dependent destroy
> transitions
Slide 69
Slide 69 text
effect
> delay is painful
> be careful where you can
tolerate delay
Slide 70
Slide 70 text
remedy
> get after a post gets a master
Slide 71
Slide 71 text
haystack
> we modified the app
> when a statement modifies too
many rows we send it to haystack
> insight
Slide 72
Slide 72 text
No content
Slide 73
Slide 73 text
throttler
> developers need to modify data
> must be replication safe
> query haproxy
> check replicas
Slide 74
Slide 74 text
contributions
> email change
> active users caused delay
> support request
> use the throttler
Slide 75
Slide 75 text
No content
Slide 76
Slide 76 text
keeping things
fast
Slide 77
Slide 77 text
tooling
> tooling is essential
> never underestimate the power
of being able to write tools
Slide 78
Slide 78 text
log it
> we built a slow query logger into
the app
Slide 79
Slide 79 text
No content
Slide 80
Slide 80 text
No content
Slide 81
Slide 81 text
haystack pager
> developer on call
> a spike in needles pages someone
Slide 82
Slide 82 text
toolbar
> staff mode
> see all queries on a page
> with times
> github.com/peek/peek
Slide 83
Slide 83 text
No content
Slide 84
Slide 84 text
No content
Slide 85
Slide 85 text
No content
Slide 86
Slide 86 text
tooling
> verification and improvement
Slide 87
Slide 87 text
slow
transactions
Slide 88
Slide 88 text
migrations
> query pile up
> site stalls
> bad user experience
Slide 89
Slide 89 text
observe
> we noticed two issues:
- table stats
- metadata locking