The MySQL Ecosystem at GitHub

Slide 1

Slide 1 text

THE MYSQL ECOSYSTEM AT GITHUB

Slide 2

Slide 2 text

SAM LAMBERT LEAD ENGINEER @ GITHUB github.com/samlambert samlambert.com twitter.com/isamlambert ! " #

Slide 3

Slide 3 text

WHAT IS GITHUB?

Slide 4

Slide 4 text

GITHUB > code hosting > collaboration > octocats

Slide 5

Slide 5 text

GITHUB > 6+ million users > 15.7 million repositories > 100+ tb of git data > 239 githubbers > 100 engineers

Slide 6

Slide 6 text

GITHUB > proudly powered by mysql

Slide 7

Slide 7 text

github.com/mysql/mysql-server

Slide 8

Slide 8 text

THE TEAM

Slide 9

Slide 9 text

infrastructure > small team ~ 15 people > responsible for scaling, automation, pager rotation, git storage and site reliability > sub team: the database infrastructure team > shout out to @dbussink

Slide 10

Slide 10 text

the github stack

Slide 11

Slide 11 text

the stack > git (obviously) > ruby/rails for github.com > c spread around the stack > puppet for provisioning > bash and ruby for scripting > elasticsearch for .com search > haystack for exceptions > resque for queues

Slide 12

Slide 12 text

ruby on rails > github/github > 203 contributors > 192,000 commits > large rails app > active record

Slide 13

Slide 13 text

active record > object relational mapper > avoids writing sql directly > can write some terrible queries > single DB host approach

Slide 14

Slide 14 text

environment > fast changing codebase > hundreds of deployments a day > tooling is extremely important

Slide 15

Slide 15 text

SELECT DATE_SUB(NOW(), INTERVAL 18 MONTH);

Slide 16

Slide 16 text

> majority of queries served from one host > replicas used for backups/ failover > old hardware/datacenter going solo

Slide 17

Slide 17 text

> unscalable > contention problems > traffic bursts caused query response times to go up read me

Slide 18

Slide 18 text

time for change

Slide 19

Slide 19 text

> needed to move data centers > chance to update hardware > new start = a chance to tune > time to functionally shard you had me at hardware

Slide 20

Slide 20 text

> a large volume of writes came from a single events table > constantly growing > no joins sharding?

Slide 21

Slide 21 text

> replicate table do > move reads onto new cluster > then finally cut writes over > stop replication replicate

Slide 22

Slide 22 text

> multiple clusters sharded functionally > separate concerns > scale writes and reads now there were two

Slide 23

Slide 23 text

> events out of the way time for the big show > the main cluster was next main cluster

Slide 24

Slide 24 text

> new hardware > ssds > loads of ram > 10gb networking bare metal

Slide 25

Slide 25 text

> single master > lots of read replicas > delayed replicas > logical backup hosts > full backup hosts build the topology

Slide 26

Slide 26 text

> regression testing is essential > replay queries from live cluster > long benchmarks: 4 hours + > one change at a time TESTING

Slide 27

Slide 27 text

> maintenance window > 13 minutes go live

Slide 28

Slide 28 text

results

Slide 29

Slide 29 text

time to use that hardware

Slide 30

Slide 30 text

start master replica replica replica apps

Slide 31

Slide 31 text

master

Slide 32

Slide 32 text

replica

Slide 33

Slide 33 text

new design master replica replica replica apps haproxy

Slide 34

Slide 34 text

app changes how do you transition a monolithic app to use multiple database hosts?

Slide 35

Slide 35 text

connections > split out the current connection > write > read only

Slide 36

Slide 36 text

GET > we made the decision to have all get requests use a replica

Slide 37

Slide 37 text

POST > all posts and gets after a post for a user use the master > after 3 seconds the user moves to a replica

Slide 38

Slide 38 text

refactoring > we wanted to take the smallest steps possible each time > we verified our changes at each step in the process

Slide 39

Slide 39 text

write alerts > how do we know we aren’t going to break anything? > we set up a connection we called “write alert” > write alert allowed writes but notified us

Slide 40

Slide 40 text

haystack > haystack is our exception tracking tool > backed by elasticsearch > awesome

Slide 41

Slide 41 text

write alerts

Slide 42

Slide 42 text

write alerts

Slide 43

Slide 43 text

write alerts > this allowed us to test moving to a read only connection without impacting users > we fixed any issues that came up > when we stopped getting alerts we knew we were ready to go read only

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

> we staff ship features and changes to help us gain confidence staff shipping

Slide 46

Slide 46 text

haproxy > needed a way of distributing queries among replicas > plenty of prior art

Slide 47

Slide 47 text

haproxy > we created haproxy pairs for ha and failover

Slide 48

Slide 48 text

gitauth > we started with a subset of our app > a proxy that checks you have permissions to push and pull to a repo > read intensive

Slide 49

Slide 49 text

% > slow ramp up > 1% > 5%

Slide 50

Slide 50 text

heartbeat > permissions are replication sensitive > pt-heartbeat > gitauth checks > 1 second of delay = move back to the master

Slide 51

Slide 51 text

build confidence > rest of the app had to follow > keep upping the %

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

failover

Slide 55

Slide 55 text

PSUs > parts go > more parts to keep github up

Slide 56

Slide 56 text

clients > pause the request > reconnect through the proxy

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

performance degradation

Slide 59

Slide 59 text

keeping an eye > graphing at github is awesome > shout out to @jssjr github.com/jssjr

Slide 60

Slide 60 text

increase in latency > we noticed an upward trend in latency

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

multi process > hasn’t always worked well in the past > connections tended to stick to a process

Slide 64

Slide 64 text

kernel > upgrades were required for better balance

Slide 65

Slide 65 text

slow and steady > deploy app to use upgraded secondary haproxy > roll through the cluster

Slide 66

Slide 66 text

the down sides

Slide 67

Slide 67 text

hurry up > replication delay is painful > be careful where you can tolerate delay

Slide 68

Slide 68 text

cause > large updates, inserts, deletes > dependent destroy > transitions

Slide 69

Slide 69 text

effect > delay is painful > be careful where you can tolerate delay

Slide 70

Slide 70 text

remedy > get after a post gets a master

Slide 71

Slide 71 text

haystack > we modified the app > when a statement modifies too many rows we send it to haystack > insight

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

throttler > developers need to modify data > must be replication safe > query haproxy > check replicas

Slide 74

Slide 74 text

contributions > email change > active users caused delay > support request > use the throttler

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

keeping things fast

Slide 77

Slide 77 text

tooling > tooling is essential > never underestimate the power of being able to write tools

Slide 78

Slide 78 text

log it > we built a slow query logger into the app

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

haystack pager > developer on call > a spike in needles pages someone

Slide 82

Slide 82 text

toolbar > staff mode > see all queries on a page > with times > github.com/peek/peek

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

tooling > verification and improvement

Slide 87

Slide 87 text

slow transactions

Slide 88

Slide 88 text

migrations > query pile up > site stalls > bad user experience

Slide 89

Slide 89 text

observe > we noticed two issues: - table stats - metadata locking

Slide 90

Slide 90 text

table stats > innodb_stats_on_metadata > innodb_stats_auto_update > github.com/samlambert/pt- online-schema-change-analyze

Slide 91

Slide 91 text

metadata > queries piled up behind a metadata lock

Slide 92

Slide 92 text

pt-osc > table copy and swap

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

prevention > smaller transactions > detection

Slide 101

Slide 101 text

No content

Slide 102

Slide 102 text

chatops

Slide 103

Slide 103 text

meet hubot > node.js > open source > github.com/github/hubot > hundreds of plugins

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

show and tell > it all happens in chat > amazing for learning > share the terminal

Slide 106

Slide 106 text

anything > drop tables > see who's in the office > deploy apps

Slide 107

Slide 107 text

culture > chat is central to our culture

Slide 108

Slide 108 text

remote > 52% of github is remote > how do you give everyone context?

Slide 109

Slide 109 text

automation > safe > intuitive > accessable > people will use it

Slide 110

Slide 110 text

explain > explain queries via hubot

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

explain > learn together > work as a team > no need for a meeting/email

Slide 113

Slide 113 text

profile > profile queries

Slide 114

Slide 114 text

No content

Slide 115

Slide 115 text

github.com/samlambert/hubot-mysql-chatops

Slide 116

Slide 116 text

shell > you do not have to write cofeescript! > 34279 lines of ruby and shell > wrapped by hubot

Slide 117

Slide 117 text

truncate > safe > visible > repeatable

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

No content

Slide 120

Slide 120 text

No content

Slide 121

Slide 121 text

backup > no excuse > available to anyone > uses an app called safehold

Slide 122

Slide 122 text

safehold > fires backup jobs into a queue > workers work on different types of jobs

Slide 123

Slide 123 text

restore > restore any logical backup > backups go to intermediate hosts

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

clone > clone tables onto test servers > great for testing indexes > developers use this a lot

Slide 126

Slide 126 text

proxy control > weight servers > take them from the pool

Slide 127

Slide 127 text

deploy /deploy

Slide 128

Slide 128 text

graph me /graph me -1h @mysql.rwps

Slide 129

Slide 129 text

No content

Slide 130

Slide 130 text

status > /status yellow > letting you all know

Slide 131

Slide 131 text

mitigate > attacks happen > why get sad? > use the chatops

Slide 132

Slide 132 text

questions?

Slide 133

Slide 133 text

SAM LAMBERT LEAD ENGINEER @ GITHUB github.com/samlambert samlambert.com twitter.com/isamlambert ! " #