Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
How to Build a GitHub
Search
Zach Holman
August 05, 2012
Programming
146
170k
How to Build a GitHub
Learn about the growth patterns and the architecture behind github.com.
Zach Holman
August 05, 2012
Tweet
Share
More Decks by Zach Holman
See All by Zach Holman
Firing People
holman
43
6.5k
Even More Emoji Abuse 🚧🚨
holman
19
11k
Move Fast and Break Nothing
holman
70
180k
The Talk on Talks
holman
66
33k
How GitHub (no longer) Works
holman
315
140k
More Git and GitHub Secrets
holman
182
110k
Keeping People
holman
64
62k
If Only I Knew This Shit in College
holman
98
100k
GitHub: Behind the Feature
holman
41
15k
Other Decks in Programming
See All in Programming
CSC305 Lecture 03
javiergs
PRO
0
210
まだ世にないサービスをAIと創る話 〜 失敗から学ぶフルスタック開発への挑戦 〜
katayamatg
0
160
Introducing ReActionView: A new ActionView-Compatible ERB Engine @ Kaigi on Rails 2025, Tokyo, Japan
marcoroth
3
620
Pythonスレッドとは結局何なのか? CPython実装から見るNoGIL時代の変化
curekoshimizu
3
870
検索機能リプレイスを4ヶ月→2ヶ月に! AI Agentで実現した2倍速リプレイス
fuuki12
4
820
WebエンジニアがSwiftをブラウザで動かすプレイグラウンドを作ってみた
ohmori_yusuke
0
160
プロダクト開発をAI 1stに変革する〜SaaS is dead時代で生き残るために〜 / AI 1st Product Development
kobakei
0
440
Serena MCPのすすめ
wadakatu
4
780
スマホで海難事故は防げるか?年間2000件以上の小型船舶の事故に挑むアプリ開発
atsuki_seo
0
120
ABEMAモバイルアプリが Kotlin Multiplatformと歩んだ5年 ─ 導入と運用、成功と課題 / iOSDC 2025
akkyie
0
290
議事録の要点整理を自動化! サーバレス Bot 構築術
penpeen
3
1.6k
フロントエンド開発に役立つクライアントプログラム共通のノウハウ / Universal client-side programming best practices for frontend development
nrslib
7
3.8k
Featured
See All Featured
Making Projects Easy
brettharned
118
6.4k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
46
7.6k
Thoughts on Productivity
jonyablonski
70
4.8k
Six Lessons from altMBA
skipperchong
28
4k
A designer walks into a library…
pauljervisheath
208
24k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
2.6k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
950
Improving Core Web Vitals using Speculation Rules API
sergeychernyshev
18
1.2k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
15
1.7k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Transcript
githu H O W t B U I L
D GITHUB a
githu
6.5MM REPOSITORIES LARGEST GIT HOST 1.9MM USERS SINCE 2008
6.5MM REPOSITORIES LARGEST GIT HOST 1.9MM USERS SINCE 2008 SVN
HOST
gh gh gh gh gh
gh
gh gh gh gh gh
gh SHOW YOU OUR CARDS going t
MAGIC BULLET there i n
FOUR STAGES OF GROWTH happiness the EVERYTHING automate
NO FORKING HOLMAN @ LOST YO QUIT READING THIS SHIT
ho DID WE GIT HERE
1809: PERL INVENTED
1814: COMPUTERS INVENTED
1814-2004: ANARCHY AND CHAOS AND ZOMG EVERYONE’S DYING
2005: VERSION CONTROL INVENTED git
2007: githu GLOBAL PEACE AND HAPPINESS ACHIEVED
...or something like that
PRESTON-WERNER TOM GRIT O C TOBER 9, 2 0 07
git via ruby
GRIT git via ruby github’s interface to git object-oriented, read/write
open source
repo = Grit::Repo.new('/tmp/repository') grit repo.commits
grit shelling out to git is expensive grit reimplements portions
of git in ruby native packfile and git object support 2x-100x speedup on low-level operations
grit slowly reimplement grit for speed allows for incremental improvements
LED TO GITHUB grit O C TOBER 19, 2 0
07
TODAY ADDING 2TB A MONTH 22 FILESERVER PAIRS 23TB OF
REPO DATA
GITHUB GROWTH THE FOUR STAGES of
LOCAL NETWORKED NET-SHARD GITRPC FOUR STAGES OF GROWTH GITHUB:
LOCAL NETWORKED NET-SHARD GITRPC FOUR STAGES OF GROWTH GITHUB: 2008
2009 2010 2012
LOCAL NETWORKED NET-SHARD GITRPC FOUR STAGES OF GROWTH GITHUB:
JAN 2008 DEC 2008 FOUR STAGES OF GROWTH GITHUB: 42,000
USERS
JAN 2008 DEC 2008 FOUR STAGES OF GROWTH GITHUB: 80,000
REPOSITORIES
LOCAL MULTI-VM SHARED GFS MOUNT
LOCAL MULTI-VM WEB FRONTENDS BACKGROUND WORKERS
LOCAL MULTI-VM SIMPLE ARCHITECTURE HORIZONTALLY SCALABLE-ish
LOCAL SHARED GFS MOUNT SHARED MOUNT ON EACH VM SIMILAR
PRODUCTION + DEVELOPMENT ACCESS ALLOWED LOCAL ACCESS VIA GRIT
SIMPLE APPROACH, COMMON GIT INTERFACE, QUICK TO BUILD AND SHIP
LOCAL
LOCAL NETWORKED FOUR STAGES OF GROWTH GITHUB: NET-SHARD GITRPC
2008 2009 2010 FOUR STAGES OF GROWTH GITHUB: 166,000 USERS
2008 2009 2010 FOUR STAGES OF GROWTH GITHUB: 484,000 REPOSITORIES
the problem: is slow GFS performance degraded as repos added
the problem: i/o-bound we’re read/write to disk needs to be
fast
THE PLAN NETWORKED HARDWARE MOVE DATACENTERS
NETWORKED HARDWARE bare metal servers 16 machines 6x RAM machine
roles solid datacenter got dat cloud
NETWORKED FRONTENDS FILESERVERS AUX DB LAUNCH: SERVER PAIRS
NETWORKED GRIT IS LOCAL NEEDS TO BE NETWORKED
NETWORKED smoke service is run on each fs; facilitates disk
access chimney routes the smoke, stores routing table in redis stub local grit calls, retain API usage, but send over network
NETWORKED server pairs offer failover via DRBD real servers, real
big RAM allocations
NETWORKED LATENCY networked routing adds 2-10ms per request optimize for
the roundtrip smoke contains smarter server-side logic
NETWORKED LATENCY smoke has custom git extension commands git-distinct-commits returns
commits only contained on a given branch calls to git-show-refs and git-rev-list run all calls server-side in one roundtrip
NETWORKED HORIZONTALLY-SCALABLE, LATENCY- CONSIDERATE, API-COMPATIBLE WITH GRIT
LOCAL FOUR STAGES OF GROWTH GITHUB: NET-SHARD GITRPC NETWORKED
2008 2009 2010 2011 FOUR STAGES OF GROWTH GITHUB: 510,000
USERS
2008 2009 2010 2011 FOUR STAGES OF GROWTH GITHUB: 1.3MM
REPOSITORIES
the problem: duplication data each fork is a full project
history
duplication data i create a repo you fork my
repo fs5:/data/repositories/6/nw/6b/de/92/1/1.git fs7:/data/repositories/4/na/3b/dr/72/2/2.git
duplication data 1,000 commits 1,001 commits 10MB 10MB 20MB
total disk }
duplication data 1,000 commits 1 commit 1KB 10MB 10MB
total disk }GOAL:
duplication data 75 MB repo 3.5k forks x ~250
GB x 2 fs pairs + offsite backups
NET-SHARD shard by repository network (“forks”)
NET-SHARD network.git 1.git 2.git 3.git 4.git CONTAINS DELTA }CONTAINS ALL
REFS ›
NET-SHARD network.git GIT ALTERNATES store git object data externally to
repository we fetch refs into your fork, transparently
NET-SHARD network.git PRIVACY potential leaking of refs cross-network net-shard enabled
on all-public and all-private repository networks only
NET-SHARD network.git DISK halves disk usage increase disk and kernel
cache hits
NET-SHARD network.git MIGRATION gradually transitioned repos to network.git effectively feature-flagged
by repo
NET-SHARD SAVE DISK, IMPROVE PERFORMANCE
LOCAL FOUR STAGES OF GROWTH GITHUB: GITRPC NETWORKED NET-SHARD
2008 2009 2010 2011 2012 FOUR STAGES OF GROWTH GITHUB:
1.2MM USERS
2008 2009 2010 2011 2012 AUGUST FOUR STAGES OF GROWTH
GITHUB: 1.9MM USERS
2008 2009 2010 2011 2012 FOUR STAGES OF GROWTH GITHUB:
3.4MM REPOSITORIES
2008 2009 2010 2011 2012 AUGUST FOUR STAGES OF GROWTH
GITHUB: 6.5MM REPOSITORIES
the problem: GRIT git via ruby
the problem: local, ruby-based grit ended up in a high-traffic
distributed system
the problem: inelegant code spread out everywhere
GITRPC network-oriented library for git access GitRPC
GITRPC open source fastest git implementation (C) github-sponsored project bindings
for all major languages used in our mac, windows clients
GITRPC rugged (RUBY) libgit2 (C) gitrpc (RUBY)
GITRPC like smoke, gitrpc aims to reduce latency by reducing
roundtrips LATENCY
GITRPC operations cached on library level CACHING yank out tons
of app-level cache logic
GITRPC the move to gitrpc started this summer and will
take months MIGRATION gradually replace smoke and grit; avoids a risky deploy
FAST AND STABLE NETWORKED GIT ACCESS GITRPC
LOCAL NETWORKED NET-SHARD GITRPC FOUR STAGES OF GROWTH GITHUB:
identify WHAT’S BROKEN
sma CHANGES, FAST DEVELOPMENT
realCODE BEATS IMAGINARY CODE
EVERYTHING automate automate automate automate automate AUTOMATE automate automate automate
automate automate automate
m . manage LOL DEVELOPERS SOFTWARE
DEVELOPMENT
m . manage DEADLINES MEETINGS PRIORITIES ESTIMATES
m . manage DEADLINES MEETINGS PRIORITIES ESTIMATES
EVERYONE i A MANAGER
AUTOMATE AWAY PAIN DEPLOYMENT RECOVERY DEVELOPMENT
DEVELOPMENT automate
DEVELOPMENT > ./do-work RUN THIS IN EACH PROJECT: ...AND YOU’RE
DONE! loljk
DEVELOPMENT YOU CAN AUTOMATE THE PAIN OF DEVELOPMENT
SETUP DEVELOPMENT the
SETUP DEVELOPMENT the ONE-LINER INSTALLS ALL GITHUB DEVELOPMENT DEPENDENCIES
30 min SETUP DEVELOPMENT the CLEAN MACHINE TO FULL
DEVELOPMENT ENVIRONMENT
SETUP DEVELOPMENT the NEW EMPLOYEES SHIP THEIR FIRST WEEK
SETUP DEVELOPMENT the PUPPET HANDLES ALL DEPENDENCIES
DEPLOYMENT automate
DEPLOYMENT REAL BROGRAMMERS DEPLOY WITH NO FEAR SO FUCK THAT
DEPLOYMENT DEPLOYS SHOULD BE CAUTIOUS, COMMONPLACE, AND AUTOMATED
DEPLOYMENT GITHUB DEPLOYS 20-40 TIMES A DAY
DEPLOYMENT PUSH BRANCH DEPLOY BRANCH EVERYWHERE · MACHINE CLASS ·
SPECIFIC SERVERS HUBOT RUNS TESTS IN ABOUT 200 SECONDS USUALLY OPEN A PULL REQUEST
DEPLOYMENT DEPLOY LOCKING CAN’T DEPLOY IF A BRANCH IS DEPLOYED
AUTODEPLOYS PUSHED TO MASTER WITH GREEN TESTS? DEPLOY.
DEPLOYMENT STAFF-ONLY FEATURE FLAGS LIMITS EXPOSURE · REAL-WORLD · AVOIDS
MERGES
RECOVERY automate
RECOVERY SOMETHING WILL ALWAYS BREAK
RECOVERY HUBOT IS A SYSADMIN
RECOVERY HUBOT LOAD HUBOT QUERIES HUBOT CONNS SERVER LOAD RUNNING
DB QUERIES ALL OPEN CONNECTIONS
RECOVERY HUBOT RESTORE <REPO> HUBOT PUSH-LOG <REPO> HUBOT GH-EACH <HOST>
<COMMAND> RESTORE A REPO FROM BACKUPS SEE RECENT PUSH LOGS TO A REPO RUN COMMAND ON SPECIFIC HOSTS
HIGH-LEVEL OVERVIEW IN MINUTES SPEND MORE TIME FIXING AND LESS
TIME INVESTIGATING RECOVERY
— happiness the — — — —
EMPLOYEES HAVE QUIT YEARS 5 EMPLOYEES 108 ZERO
1-2 MONTHS HIRE 1-3 MONTHS RAMP-UP 2 WEEKS LEAVE
LOSING AN EMPLOYEE CAN SET YOU BACK HALF A YEAR
remove ANY REASON TO LEAVE — — — — —
— — — — — — — — — — — —
TDD✓ PAIR PROGRAMMING ✓ BDD ✓ TEST-FIRST ✓ DESIGN-FIRST ✓
(just kidding) EMACS x NONE OF THESE ✓
WE CARE ABOUT THE WORK YOU DO, NOT ABOUT HOW
YOU DO IT
LOCATION HOURS DIRECTION
LOCATION HOURS DIRECTION GITHUB EMPLOYEES WORK REMOTELY
⅔
LOCATION HOURS DIRECTION FAMILY RELOCATION, TRAVEL FREEDOM
LOCATION HOURS DIRECTION CHOOSE YOUR SCHEDULE CHOOSE
YOUR VACATIONS FRESH, CREATIVE EMPLOYEES
LOCATION HOURS DIRECTION YOU HACK ON THINGS
THAT INTEREST YOU REDUCES BURNOUT
flexible LOCATION HOURS DIRECTION BE TOWARDS WORK/LIFE
githu
basica y, MOVE FAST = SMALL CHANGES
basica y, BE STABLE = DEPLOY CONSTANTLY
basica y, HAPPY COMPANY = HAPPY EMPLOYEES
thank
NO FORKING HOLMAN @ LOST YO QUIT READING THIS SHIT
ZACHHOLMAN.COM/TALKS