Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to Build a GitHub

How to Build a GitHub

Learn about the growth patterns and the architecture behind github.com.

Zach Holman

August 05, 2012
Tweet

More Decks by Zach Holman

Other Decks in Programming

Transcript


  1. githu
    H O W
    t
    B U I L D
    GITHUB
    a

    View Slide

  2. githu

    View Slide

  3. 6.5MM REPOSITORIES
    LARGEST GIT HOST
    1.9MM USERS
    SINCE 2008

    View Slide

  4. 6.5MM REPOSITORIES
    LARGEST GIT HOST
    1.9MM USERS
    SINCE 2008
    SVN HOST

    View Slide

  5. gh

    gh

    gh

    gh

    gh

    gh

    View Slide

  6. gh

    gh

    gh

    gh

    gh

    gh

    SHOW YOU OUR CARDS
    going t

    View Slide

  7. MAGIC BULLET
    there i n

    View Slide

  8. FOUR STAGES OF GROWTH
    happiness
    the
    EVERYTHING
    automate

    View Slide

  9. NO
    FORKING
    HOLMAN
    @
    LOST
    YO QUIT READING THIS SHIT

    View Slide

  10. ho
    DID WE GIT HERE

    View Slide

  11. 1809:
    PERL INVENTED

    View Slide

  12. 1814:
    COMPUTERS INVENTED

    View Slide

  13. 1814-2004:
    ANARCHY AND CHAOS AND
    ZOMG EVERYONE’S DYING

    View Slide

  14. 2005:
    VERSION CONTROL INVENTED
    git

    View Slide

  15. 2007:
    githu
    GLOBAL PEACE AND
    HAPPINESS ACHIEVED

    View Slide

  16. ...or something like that

    View Slide

  17. PRESTON-WERNER
    TOM
    GRIT
    O C TOBER 9, 2 0 07
    git via ruby

    View Slide

  18. GRIT
    git via ruby
    github’s interface to git
    object-oriented, read/write
    open source

    View Slide

  19. repo = Grit::Repo.new('/tmp/repository')
    grit
    repo.commits

    View Slide

  20. grit
    shelling out to git is expensive
    grit reimplements portions of git in ruby
    native packfile and git object support
    2x-100x speedup on low-level operations

    View Slide

  21. grit
    slowly reimplement grit for speed
    allows for incremental improvements

    View Slide

  22. LED TO GITHUB
    grit O C TOBER 19, 2 0 07

    View Slide

  23. TODAY
    ADDING 2TB A MONTH
    22 FILESERVER PAIRS
    23TB OF REPO DATA

    View Slide

  24. GITHUB GROWTH
    THE FOUR STAGES
    of

    View Slide

  25. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View Slide

  26. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:
    2008 2009 2010 2012

    View Slide

  27. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View Slide

  28. JAN 2008 DEC 2008
    FOUR STAGES OF GROWTH
    GITHUB:
    42,000 USERS 

    View Slide

  29. JAN 2008 DEC 2008
    FOUR STAGES OF GROWTH
    GITHUB:
    80,000 REPOSITORIES 

    View Slide

  30. LOCAL
    MULTI-VM
    SHARED GFS MOUNT

    View Slide

  31. LOCAL
    MULTI-VM
    WEB FRONTENDS
    BACKGROUND WORKERS

    View Slide

  32. LOCAL
    MULTI-VM
    SIMPLE ARCHITECTURE
    HORIZONTALLY SCALABLE-ish

    View Slide

  33. LOCAL
    SHARED GFS MOUNT
    SHARED MOUNT ON EACH VM
    SIMILAR PRODUCTION + DEVELOPMENT ACCESS
    ALLOWED LOCAL ACCESS VIA GRIT

    View Slide

  34. SIMPLE APPROACH, COMMON GIT
    INTERFACE, QUICK TO BUILD AND SHIP
    LOCAL

    View Slide

  35. LOCAL NETWORKED
    FOUR STAGES OF GROWTH
    GITHUB:
    NET-SHARD GITRPC

    View Slide

  36. 2008 2009 2010
    FOUR STAGES OF GROWTH
    GITHUB:
    166,000 USERS 

    View Slide

  37. 2008 2009 2010
    FOUR STAGES OF GROWTH
    GITHUB:
    484,000 REPOSITORIES 

    View Slide

  38. the problem:
    is slow
    GFS
    performance degraded as repos added

    View Slide

  39. the problem:
    i/o-bound
    we’re
    read/write to disk needs to be fast

    View Slide

  40. THE PLAN
    NETWORKED
    HARDWARE
    MOVE DATACENTERS

    View Slide

  41. NETWORKED
    HARDWARE
    bare metal servers
    16 machines
    6x RAM
    machine roles
    solid datacenter
    got dat cloud

    View Slide

  42. NETWORKED
    FRONTENDS FILESERVERS AUX DB
    LAUNCH:
    SERVER PAIRS

    View Slide

  43. NETWORKED
    GRIT IS LOCAL
    NEEDS TO BE NETWORKED

    View Slide

  44. NETWORKED
    smoke service is run on each fs;
    facilitates disk access
    chimney routes the smoke,
    stores routing table in redis
    stub local grit calls, retain API
    usage, but send over network

    View Slide

  45. NETWORKED
    server pairs offer failover via DRBD
    real servers, real big RAM allocations

    View Slide

  46. NETWORKED
    LATENCY
    networked routing adds 2-10ms per request
    optimize for the roundtrip
    smoke contains smarter server-side logic

    View Slide

  47. NETWORKED
    LATENCY
    smoke has custom git extension commands
    git-distinct-commits
    returns commits only contained on a given branch
    calls to git-show-refs and git-rev-list
    run all calls server-side in one roundtrip

    View Slide

  48. NETWORKED
    HORIZONTALLY-SCALABLE, LATENCY-
    CONSIDERATE, API-COMPATIBLE WITH GRIT

    View Slide

  49. LOCAL
    FOUR STAGES OF GROWTH
    GITHUB:
    NET-SHARD GITRPC
    NETWORKED

    View Slide

  50. 2008 2009 2010 2011
    FOUR STAGES OF GROWTH
    GITHUB:
    510,000 USERS 

    View Slide

  51. 2008 2009 2010 2011
    FOUR STAGES OF GROWTH
    GITHUB:
    1.3MM REPOSITORIES 

    View Slide

  52. the problem:
    duplication
    data
    each fork is a full project history

    View Slide

  53. duplication
    data 
    i create a repo
    you fork my repo
    fs5:/data/repositories/6/nw/6b/de/92/1/1.git
    fs7:/data/repositories/4/na/3b/dr/72/2/2.git

    View Slide

  54. duplication
    data 
    1,000 commits
    1,001 commits
    10MB
    10MB
    20MB total disk
    }

    View Slide

  55. duplication
    data 
    1,000 commits
    1 commit
    1KB
    10MB
    10MB total disk
    }GOAL:

    View Slide

  56. duplication
    data 
    75 MB repo
    3.5k forks
    x
    ~250 GB
    x 2 fs pairs + offsite backups

    View Slide

  57. NET-SHARD
    shard by repository network
    (“forks”)

    View Slide

  58. NET-SHARD
    network.git
    1.git
    2.git
    3.git
    4.git
    CONTAINS DELTA
    }CONTAINS ALL REFS

    View Slide

  59. NET-SHARD
    network.git
    GIT ALTERNATES
    store git object data externally to repository
    we fetch refs into your fork, transparently

    View Slide

  60. NET-SHARD
    network.git
    PRIVACY
    potential leaking of refs cross-network
    net-shard enabled on all-public and all-private
    repository networks only

    View Slide

  61. NET-SHARD
    network.git
    DISK
    halves disk usage
    increase disk and kernel cache hits

    View Slide

  62. NET-SHARD
    network.git
    MIGRATION
    gradually transitioned repos to network.git
    effectively feature-flagged by repo

    View Slide

  63. NET-SHARD
    SAVE DISK, IMPROVE PERFORMANCE

    View Slide

  64. LOCAL
    FOUR STAGES OF GROWTH
    GITHUB:
    GITRPC
    NETWORKED NET-SHARD

    View Slide

  65. 2008 2009 2010 2011 2012
    FOUR STAGES OF GROWTH
    GITHUB:
    1.2MM USERS 

    View Slide

  66. 2008 2009 2010 2011 2012 AUGUST
    FOUR STAGES OF GROWTH
    GITHUB:
    1.9MM USERS 

    View Slide

  67. 2008 2009 2010 2011 2012
    FOUR STAGES OF GROWTH
    GITHUB:
    3.4MM REPOSITORIES 

    View Slide

  68. 2008 2009 2010 2011 2012 AUGUST
    FOUR STAGES OF GROWTH
    GITHUB:
    6.5MM REPOSITORIES 

    View Slide

  69. the problem:
    GRIT
    git via ruby

    View Slide

  70. the problem:
    local, ruby-based grit ended up
    in a high-traffic distributed system

    View Slide

  71. the problem:
    inelegant code spread out everywhere

    View Slide

  72. GITRPC
    network-oriented library for git access
    GitRPC

    View Slide

  73. GITRPC
    open source
    fastest git implementation (C)
    github-sponsored project
    bindings for all major languages
    used in our mac, windows clients

    View Slide

  74. GITRPC
    rugged (RUBY)
    libgit2 (C)
    gitrpc (RUBY)

    View Slide

  75. GITRPC
    like smoke, gitrpc aims to
    reduce latency by reducing roundtrips
    LATENCY

    View Slide

  76. GITRPC
    operations cached on library level
    CACHING
    yank out tons of app-level cache logic

    View Slide

  77. GITRPC
    the move to gitrpc started this
    summer and will take months
    MIGRATION
    gradually replace smoke and grit;
    avoids a risky deploy

    View Slide

  78. FAST AND STABLE NETWORKED GIT ACCESS
    GITRPC

    View Slide

  79. LOCAL NETWORKED NET-SHARD GITRPC
    FOUR STAGES OF GROWTH
    GITHUB:

    View Slide

  80. identify
    WHAT’S BROKEN

    View Slide

  81. sma
    CHANGES, FAST DEVELOPMENT

    View Slide

  82. realCODE BEATS
    IMAGINARY CODE

    View Slide

  83. EVERYTHING
    automate
    automate
    automate
    automate
    automate
    AUTOMATE
    automate
    automate
    automate
    automate
    automate
    automate

    View Slide





  84. m . manage
    LOL DEVELOPERS
    SOFTWARE
    DEVELOPMENT

    View Slide




  85. m . manage
    DEADLINES
    MEETINGS
    PRIORITIES
    ESTIMATES

    View Slide




  86. m . manage
    DEADLINES
    MEETINGS
    PRIORITIES
    ESTIMATES

    View Slide

  87.  EVERYONE
    i
    A MANAGER

    View Slide

  88. AUTOMATE AWAY PAIN
    DEPLOYMENT RECOVERY
    DEVELOPMENT

    View Slide

  89. DEVELOPMENT
    automate

    View Slide

  90. DEVELOPMENT
    > ./do-work
    RUN THIS IN EACH PROJECT:
    ...AND YOU’RE DONE!
    loljk

    View Slide

  91. DEVELOPMENT
    YOU CAN AUTOMATE THE PAIN OF
    DEVELOPMENT

    View Slide

  92. SETUP
    DEVELOPMENT
    the

    View Slide

  93. SETUP DEVELOPMENT
    the
    ONE-LINER INSTALLS ALL
    GITHUB DEVELOPMENT
    DEPENDENCIES

    View Slide


  94. 30 min
    SETUP DEVELOPMENT
    the
    CLEAN MACHINE TO
    FULL DEVELOPMENT
    ENVIRONMENT

    View Slide

  95. SETUP DEVELOPMENT
    the
    NEW EMPLOYEES
    SHIP
    THEIR FIRST WEEK

    View Slide

  96. SETUP DEVELOPMENT
    the
    PUPPET
    HANDLES ALL DEPENDENCIES

    View Slide

  97. DEPLOYMENT
    automate

    View Slide

  98. DEPLOYMENT
    REAL BROGRAMMERS
    DEPLOY WITH
    NO FEAR
    SO FUCK THAT

    View Slide

  99. DEPLOYMENT
    DEPLOYS SHOULD BE CAUTIOUS,
    COMMONPLACE, AND AUTOMATED

    View Slide

  100. DEPLOYMENT
    GITHUB DEPLOYS 20-40 TIMES A DAY

    View Slide

  101. DEPLOYMENT
    PUSH BRANCH
    DEPLOY BRANCH
    EVERYWHERE · MACHINE CLASS · SPECIFIC SERVERS
    HUBOT RUNS TESTS
    IN ABOUT 200 SECONDS
    USUALLY OPEN A PULL REQUEST

    View Slide

  102. DEPLOYMENT
    DEPLOY LOCKING
    CAN’T DEPLOY IF A BRANCH IS DEPLOYED
    AUTODEPLOYS
    PUSHED TO MASTER WITH GREEN TESTS? DEPLOY.

    View Slide

  103. DEPLOYMENT
    STAFF-ONLY FEATURE FLAGS
    LIMITS EXPOSURE · REAL-WORLD · AVOIDS MERGES

    View Slide

  104. RECOVERY
    automate

    View Slide

  105. RECOVERY
    SOMETHING WILL ALWAYS BREAK

    View Slide

  106. RECOVERY
    HUBOT
    IS A SYSADMIN

    View Slide

  107. RECOVERY
    HUBOT LOAD
    HUBOT QUERIES
    HUBOT CONNS
    SERVER LOAD
    RUNNING DB QUERIES
    ALL OPEN CONNECTIONS

    View Slide

  108. RECOVERY
    HUBOT RESTORE
    HUBOT PUSH-LOG
    HUBOT GH-EACH
    RESTORE A REPO FROM BACKUPS
    SEE RECENT PUSH LOGS TO A REPO
    RUN COMMAND ON SPECIFIC HOSTS

    View Slide

  109. HIGH-LEVEL OVERVIEW IN MINUTES
    SPEND MORE TIME FIXING AND LESS TIME INVESTIGATING
    RECOVERY

    View Slide


  110. happiness
    the




    View Slide

  111. EMPLOYEES
    HAVE QUIT
    YEARS
    5
    EMPLOYEES
    108
    ZERO

    View Slide

  112. 1-2 MONTHS
    HIRE
    1-3 MONTHS
    RAMP-UP
    2 WEEKS
    LEAVE

    View Slide

  113. LOSING AN EMPLOYEE CAN
    SET YOU BACK HALF A YEAR

    View Slide

  114. remove
    ANY REASON TO
    LEAVE
    — — — — — — — — — — — — — — — — —

    View Slide

  115. TDD✓
    PAIR
    PROGRAMMING

    BDD

    TEST-FIRST

    DESIGN-FIRST

    (just kidding)
    EMACS
    x
    NONE OF
    THESE

    View Slide

  116. WE CARE ABOUT
    THE WORK
    YOU DO, NOT ABOUT
    HOW YOU DO IT

    View Slide

  117. LOCATION

    HOURS

    DIRECTION

    View Slide

  118. LOCATION
     HOURS

    DIRECTION

    GITHUB EMPLOYEES
    WORK REMOTELY

    View Slide

  119. LOCATION
     HOURS

    DIRECTION

    FAMILY RELOCATION,
    TRAVEL FREEDOM

    View Slide

  120. LOCATION

    HOURS
     DIRECTION

    CHOOSE
    YOUR
    SCHEDULE
    CHOOSE
    YOUR
    VACATIONS
    FRESH, CREATIVE EMPLOYEES

    View Slide

  121. LOCATION

    HOURS

    DIRECTION

    YOU
    HACK ON THINGS
    THAT INTEREST YOU
    REDUCES BURNOUT

    View Slide

  122. flexible
    LOCATION

    HOURS

    DIRECTION

    BE
    TOWARDS WORK/LIFE

    View Slide

  123. githu

    View Slide

  124. basica y,
    MOVE FAST =
    SMALL CHANGES

    View Slide

  125. basica y,
    BE STABLE =
    DEPLOY CONSTANTLY

    View Slide

  126. basica y,
    HAPPY COMPANY =
    HAPPY EMPLOYEES

    View Slide

  127. thank

    View Slide

  128. NO
    FORKING
    HOLMAN
    @
    LOST
    YO QUIT READING THIS SHIT
    ZACHHOLMAN.COM/TALKS

    View Slide