grit
shelling out to git is expensive
grit reimplements portions of git in ruby
native packfile and git object support
2x-100x speedup on low-level operations
Slide 21
Slide 21 text
grit
slowly reimplement grit for speed
allows for incremental improvements
Slide 22
Slide 22 text
LED TO GITHUB
grit O C TOBER 19, 2 0 07
Slide 23
Slide 23 text
TODAY
ADDING 2TB A MONTH
22 FILESERVER PAIRS
23TB OF REPO DATA
Slide 24
Slide 24 text
GITHUB GROWTH
THE FOUR STAGES
of
Slide 25
Slide 25 text
LOCAL NETWORKED NET-SHARD GITRPC
FOUR STAGES OF GROWTH
GITHUB:
Slide 26
Slide 26 text
LOCAL NETWORKED NET-SHARD GITRPC
FOUR STAGES OF GROWTH
GITHUB:
2008 2009 2010 2012
Slide 27
Slide 27 text
LOCAL NETWORKED NET-SHARD GITRPC
FOUR STAGES OF GROWTH
GITHUB:
Slide 28
Slide 28 text
JAN 2008 DEC 2008
FOUR STAGES OF GROWTH
GITHUB:
42,000 USERS
Slide 29
Slide 29 text
JAN 2008 DEC 2008
FOUR STAGES OF GROWTH
GITHUB:
80,000 REPOSITORIES
Slide 30
Slide 30 text
LOCAL
MULTI-VM
SHARED GFS MOUNT
Slide 31
Slide 31 text
LOCAL
MULTI-VM
WEB FRONTENDS
BACKGROUND WORKERS
Slide 32
Slide 32 text
LOCAL
MULTI-VM
SIMPLE ARCHITECTURE
HORIZONTALLY SCALABLE-ish
Slide 33
Slide 33 text
LOCAL
SHARED GFS MOUNT
SHARED MOUNT ON EACH VM
SIMILAR PRODUCTION + DEVELOPMENT ACCESS
ALLOWED LOCAL ACCESS VIA GRIT
Slide 34
Slide 34 text
SIMPLE APPROACH, COMMON GIT
INTERFACE, QUICK TO BUILD AND SHIP
LOCAL
Slide 35
Slide 35 text
LOCAL NETWORKED
FOUR STAGES OF GROWTH
GITHUB:
NET-SHARD GITRPC
Slide 36
Slide 36 text
2008 2009 2010
FOUR STAGES OF GROWTH
GITHUB:
166,000 USERS
Slide 37
Slide 37 text
2008 2009 2010
FOUR STAGES OF GROWTH
GITHUB:
484,000 REPOSITORIES
Slide 38
Slide 38 text
the problem:
is slow
GFS
performance degraded as repos added
Slide 39
Slide 39 text
the problem:
i/o-bound
we’re
read/write to disk needs to be fast
Slide 40
Slide 40 text
THE PLAN
NETWORKED
HARDWARE
MOVE DATACENTERS
Slide 41
Slide 41 text
NETWORKED
HARDWARE
bare metal servers
16 machines
6x RAM
machine roles
solid datacenter
got dat cloud
Slide 42
Slide 42 text
NETWORKED
FRONTENDS FILESERVERS AUX DB
LAUNCH:
SERVER PAIRS
Slide 43
Slide 43 text
NETWORKED
GRIT IS LOCAL
NEEDS TO BE NETWORKED
Slide 44
Slide 44 text
NETWORKED
smoke service is run on each fs;
facilitates disk access
chimney routes the smoke,
stores routing table in redis
stub local grit calls, retain API
usage, but send over network
Slide 45
Slide 45 text
NETWORKED
server pairs offer failover via DRBD
real servers, real big RAM allocations
Slide 46
Slide 46 text
NETWORKED
LATENCY
networked routing adds 2-10ms per request
optimize for the roundtrip
smoke contains smarter server-side logic
Slide 47
Slide 47 text
NETWORKED
LATENCY
smoke has custom git extension commands
git-distinct-commits
returns commits only contained on a given branch
calls to git-show-refs and git-rev-list
run all calls server-side in one roundtrip
Slide 48
Slide 48 text
NETWORKED
HORIZONTALLY-SCALABLE, LATENCY-
CONSIDERATE, API-COMPATIBLE WITH GRIT
Slide 49
Slide 49 text
LOCAL
FOUR STAGES OF GROWTH
GITHUB:
NET-SHARD GITRPC
NETWORKED
Slide 50
Slide 50 text
2008 2009 2010 2011
FOUR STAGES OF GROWTH
GITHUB:
510,000 USERS
Slide 51
Slide 51 text
2008 2009 2010 2011
FOUR STAGES OF GROWTH
GITHUB:
1.3MM REPOSITORIES
Slide 52
Slide 52 text
the problem:
duplication
data
each fork is a full project history
Slide 53
Slide 53 text
duplication
data
i create a repo
you fork my repo
fs5:/data/repositories/6/nw/6b/de/92/1/1.git
fs7:/data/repositories/4/na/3b/dr/72/2/2.git
Slide 54
Slide 54 text
duplication
data
1,000 commits
1,001 commits
10MB
10MB
20MB total disk
}
Slide 55
Slide 55 text
duplication
data
1,000 commits
1 commit
1KB
10MB
10MB total disk
}GOAL:
Slide 56
Slide 56 text
duplication
data
75 MB repo
3.5k forks
x
~250 GB
x 2 fs pairs + offsite backups