Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Code Archive - HOPE XI

The Code Archive - HOPE XI

Recording: https://vimeo.com/177318837

Filippo Valsorda, Salman Aljammaz

Archiving web pages is hard. Crawling, images, assets... Javascript! But archiving code is not. It comes as content-addressed objects neatly packaged in repositories and tagged with refs. It compresses well. Changes can be detected in real time with the GitHub Firehose API. Nevertheless, we need to do it today while the host is healthy, and not wait for it to start bundling adware or slowly fade away. Otherwise, in ten years we'll find ourselves running unreproducible binaries on Javascript emulators, or unable to build the software that could recover all our pictures because that one dependency is missing. This is a talk about building The Code Archive, a Wayback Machine for git. Every time a repository changes on GitHub, Code Archive systems fetch it and archive all the files, commits, tags, and branches as they were at that time. Then you can clone a repository as it was at any point in time, even if the original has been rebased, has disappeared, or GitHub is down. There's a lot of fun to be had when (ab)using the git protocol to clone and pull millions of repositories to the same database. Speakers will show what git looks like on the wire and how fetches are optimized. Also, all the Go code powering the Archive is available... on GitHub.

Filippo Valsorda

July 22, 2016
Tweet

More Decks by Filippo Valsorda

Other Decks in Programming

Transcript

  1. And we built it! There are 390,123 snapshots of 91,396

    repositories totalling 1.025 terabytes As of 2016-07-22
  2. The prototype • GitHub only • Active repositories (fetched on

    push) • Popular repositories (at least 10 ★ ) • Reasonable repositories size
  3. The Drinker • Monitor firehose for push, create, open source

    events • Queue repositories • Filter by number of stars
  4. The Drinker • Monitor firehose for push, create, open source

    events • Queue repositories • Filter by number of stars • GitHub API rate limit: 5K / hour • Drink from https://www.githubarchive.org/ • Cache number of stars, update via events
  5. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client

    Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]
  6. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c HEAD

    → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03 10:55:33 Z 2016-05-03 09:55:11 Z
  7. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client

    Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]
  8. Cold cool storage • Upload is cheap, retrieval and download

    is expensive • Keep the refs (branch/tag name to commit) in a live db • Store the packfiles and never look at them • The server will do the diffs
  9. Forks • Forks sync from parent • Parent gets PR

    from forks • Send “ haves ” for the entire network • Build a packfile dependency tree
  10. • Run at off-peak hours • Use the raw git

    protocol • Set user agents • Only fetch diffs, packs have to be used together
  11. Disappearing repositories DMCA $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git Cloning into 'rtmpdump-2.5'...

    fatal: remote error: Repository unavailable due to DMCA takedown. See the takedown notice for more details: https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump. md.
  12. The Backpanel Our admin UI! E.g. blacklist of excessively large

    repos, whitelist of exceptions to that, manually deleted repos … but building UIs sucks (for us, anyway)
  13. The Frontend • git clone interface • Web interface •

    Retrieval is expensive • Outbound bandwidth even more
  14. Local cache and alternates . ├── HEAD ├── config ├──

    objects/ │ └── info/ │ └── alternates ├── packed-refs └── refs/
  15. Clone one or all snapshots Exactly like it looked at

    a given time: $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt All the snapshots at once: $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt $ git branch 2016-07-01Z11:44:11/master 2016-07-02Z22:44:00/master
  16. One more step $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt Welcome to the

    Code Archive! Since download bandwidth is expensive, please click here to verify that you are human: https://codearchive.org/captcha/72f878a9670ab664 The download will start automatically...
  17. Web UI • Work in progress • Wayback machine style

    slider at the top • We suck at UIs. PRs welcomed!
  18. Hiding things :( • Login with GitHub and hide your

    repositories • Automated DMCA processing