Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Code Archive - HOPE XI

The Code Archive - HOPE XI

Recording: https://vimeo.com/177318837

Filippo Valsorda, Salman Aljammaz

Archiving web pages is hard. Crawling, images, assets... Javascript! But archiving code is not. It comes as content-addressed objects neatly packaged in repositories and tagged with refs. It compresses well. Changes can be detected in real time with the GitHub Firehose API. Nevertheless, we need to do it today while the host is healthy, and not wait for it to start bundling adware or slowly fade away. Otherwise, in ten years we'll find ourselves running unreproducible binaries on Javascript emulators, or unable to build the software that could recover all our pictures because that one dependency is missing. This is a talk about building The Code Archive, a Wayback Machine for git. Every time a repository changes on GitHub, Code Archive systems fetch it and archive all the files, commits, tags, and branches as they were at that time. Then you can clone a repository as it was at any point in time, even if the original has been rebased, has disappeared, or GitHub is down. There's a lot of fun to be had when (ab)using the git protocol to clone and pull millions of repositories to the same database. Speakers will show what git looks like on the wire and how fetches are optimized. Also, all the Go code powering the Archive is available... on GitHub.

Filippo Valsorda

July 22, 2016
Tweet

More Decks by Filippo Valsorda

Other Decks in Programming

Transcript

  1. The Code Archive
    Clone ALL the code.

    View full-size slide

  2. @FiloSottile
    Filippo Valsorda
    @saljam_
    Salman Aljammaz

    View full-size slide

  3. GitHub alone has
    34,000,000 repositories
    14,000,000 users

    View full-size slide

  4. Go
    import (
    "github.com/miekg/dns"
    "golang.org/x/exp/io/i2c"
    )

    View full-size slide

  5. Repositories get deleted.

    View full-size slide

  6. Branches get rebased.

    View full-size slide

  7. Histories get rewritten.

    View full-size slide

  8. Services go down.

    View full-size slide

  9. Services disappear.

    View full-size slide

  10. We <3 GitHub!!

    View full-size slide

  11. We want a Wayback Machine for code

    View full-size slide

  12. And we built it!
    There are
    390,123 snapshots
    of 91,396 repositories
    totalling 1.025 terabytes
    As of 2016-07-22

    View full-size slide

  13. The prototype

    GitHub only

    Active repositories (fetched on push)

    Popular repositories (at least 10

    )

    Reasonable repositories size

    View full-size slide

  14. Architecture
    Drinker Fetcher
    Pack blob
    storage
    GH
    API
    Queue
    git pull
    Frontend

    View full-size slide

  15. The Drinker
    Drink the GitHub Firehose!

    View full-size slide

  16. The Drinker

    Monitor firehose for push, create, open source events

    Queue repositories

    Filter by number of stars

    View full-size slide

  17. The Drinker

    Monitor firehose for push, create, open source events

    Queue repositories

    Filter by number of stars

    GitHub API rate limit: 5K / hour

    Drink from https://www.githubarchive.org/

    Cache number of stars, update via events

    View full-size slide

  18. Cache size: 7 million

    View full-size slide

  19. https://github.com/google/go-github/pull/317

    View full-size slide

  20. The Fetcher
    Just fetch the repos!

    View full-size slide

  21. The Fetcher
    Just fetch the repos!
    But fetch to what?

    View full-size slide

  22. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...]
    003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master
    003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3
    0000
    Server → client
    Client → server
    003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...]
    0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
    0009done
    0000
    Server → client: [packfile]

    View full-size slide

  23. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb
    master → 867387cf1373910c60c78cab81085cb846fadfdb
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    2016-05-03 09:55:11 Z

    View full-size slide

  24. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb
    master → 867387cf1373910c60c78cab81085cb846fadfdb
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    HEAD → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0
    master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    2016-05-03 10:55:33 Z
    2016-05-03 09:55:11 Z

    View full-size slide

  25. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...]
    003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master
    003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3
    0000
    Server → client
    Client → server
    003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...]
    0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
    0009done
    0000
    Server → client: [packfile]

    View full-size slide

  26. Cold cool storage

    Upload is cheap, retrieval and download is expensive

    Keep the refs (branch/tag name to commit) in a live db

    Store the packfiles and never look at them

    The server will do the diffs

    View full-size slide

  27. Forks

    Forks sync from parent

    Parent gets PR from forks

    Send “
    haves
    ” for the entire network

    Build a packfile dependency tree

    View full-size slide


  28. Run at off-peak hours

    Use the raw git protocol

    Set user agents

    Only fetch diffs, packs have to be used together

    View full-size slide

  29. Gigantic repositories

    View full-size slide

  30. Gigantic repositories
    9GB repo. WTF.

    View full-size slide

  31. Disappearing repositories
    401 Unauthorized

    View full-size slide

  32. Disappearing repositories
    DMCA
    $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git
    Cloning into 'rtmpdump-2.5'...
    fatal: remote error:
    Repository unavailable due to DMCA takedown.
    See the takedown notice for more details:
    https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump.
    md.

    View full-size slide

  33. Last minute crashes
    “Have you tried turning it off and on again?”

    View full-size slide

  34. The Backpanel

    View full-size slide

  35. The Backpanel
    Our admin UI!
    E.g. blacklist of excessively large repos,
    whitelist of exceptions to that,
    manually deleted repos
    … but building UIs sucks (for us, anyway)

    View full-size slide

  36. The Backpanel

    View full-size slide

  37. The Backpanel
    Lazy...
    But it works!

    View full-size slide

  38. The Backpanel
    https://trello.com/b/04pbw4Gv/blacklist

    View full-size slide

  39. The Frontend

    View full-size slide

  40. The Frontend
    ● git clone
    interface

    Web interface

    View full-size slide

  41. The Frontend
    ● git clone
    interface

    Web interface

    Retrieval is expensive

    Outbound bandwidth even more

    View full-size slide

  42. Local cache and alternates
    .
    ├── HEAD
    ├── config
    ├── objects/
    │ └── info/
    │ └── alternates
    ├── packed-refs
    └── refs/

    View full-size slide

  43. Clone one or all snapshots
    Exactly like it looked at a given time:
    $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt
    All the snapshots at once:
    $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt
    $ git branch
    2016-07-01Z11:44:11/master
    2016-07-02Z22:44:00/master

    View full-size slide

  44. One more step
    $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt
    Welcome to the Code Archive!
    Since download bandwidth is expensive, please click here to verify that
    you are human:
    https://codearchive.org/captcha/72f878a9670ab664
    The download will start automatically...

    View full-size slide

  45. Web UI

    Work in progress

    Wayback machine style slider at the top

    We suck at UIs. PRs welcomed!

    View full-size slide

  46. Things to come

    View full-size slide

  47. Beyond git and GitHub

    View full-size slide

  48. Hiding things :(

    Login with GitHub and hide your repositories

    Automated DMCA processing

    View full-size slide

  49. Long term storage
    B2 object storage.
    Sponsored by

    View full-size slide

  50. Thank you!
    https://codearchive.org
    Filippo Valsorda - @FiloSottile
    Salman Aljammaz - @saljam_

    View full-size slide