The Code Archive - HOPE XI

The Code Archive - HOPE XI

Recording: https://vimeo.com/177318837

Filippo Valsorda, Salman Aljammaz

Archiving web pages is hard. Crawling, images, assets... Javascript! But archiving code is not. It comes as content-addressed objects neatly packaged in repositories and tagged with refs. It compresses well. Changes can be detected in real time with the GitHub Firehose API. Nevertheless, we need to do it today while the host is healthy, and not wait for it to start bundling adware or slowly fade away. Otherwise, in ten years we'll find ourselves running unreproducible binaries on Javascript emulators, or unable to build the software that could recover all our pictures because that one dependency is missing. This is a talk about building The Code Archive, a Wayback Machine for git. Every time a repository changes on GitHub, Code Archive systems fetch it and archive all the files, commits, tags, and branches as they were at that time. Then you can clone a repository as it was at any point in time, even if the original has been rebased, has disappeared, or GitHub is down. There's a lot of fun to be had when (ab)using the git protocol to clone and pull millions of repositories to the same database. Speakers will show what git looks like on the wire and how fetches are optimized. Also, all the Go code powering the Archive is available... on GitHub.

9fdab9d005b82612cadbfe699b541f83?s=128

Filippo Valsorda

July 22, 2016
Tweet

Transcript

  1. The Code Archive Clone ALL the code.

  2. @FiloSottile Filippo Valsorda @saljam_ Salman Aljammaz

  3. None
  4. None
  5. None
  6. GitHub alone has 34,000,000 repositories 14,000,000 users

  7. Go import ( "github.com/miekg/dns" "golang.org/x/exp/io/i2c" )

  8. Repositories get deleted.

  9. Branches get rebased.

  10. Histories get rewritten.

  11. Services go down.

  12. None
  13. None
  14. None
  15. None
  16. None
  17. Services disappear.

  18. We <3 GitHub!!

  19. None
  20. We want a Wayback Machine for code

  21. And we built it! There are 390,123 snapshots of 91,396

    repositories totalling 1.025 terabytes As of 2016-07-22
  22. The prototype • GitHub only • Active repositories (fetched on

    push) • Popular repositories (at least 10 ★ ) • Reasonable repositories size
  23. Architecture Drinker Fetcher Pack blob storage GH API Queue git

    pull Frontend
  24. The Drinker Drink the GitHub Firehose!

  25. The Drinker • Monitor firehose for push, create, open source

    events • Queue repositories • Filter by number of stars
  26. The Drinker • Monitor firehose for push, create, open source

    events • Queue repositories • Filter by number of stars • GitHub API rate limit: 5K / hour • Drink from https://www.githubarchive.org/ • Cache number of stars, update via events
  27. None
  28. None
  29. None
  30. None
  31. None
  32. Cache size: 7 million

  33. https://github.com/google/go-github/pull/317

  34. The Fetcher Just fetch the repos!

  35. The Fetcher Just fetch the repos! But fetch to what?

  36. None
  37. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client

    Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]
  38. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03

    09:55:11 Z
  39. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c HEAD

    → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03 10:55:33 Z 2016-05-03 09:55:11 Z
  40. None
  41. None
  42. None
  43. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client

    Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]
  44. Cold cool storage • Upload is cheap, retrieval and download

    is expensive • Keep the refs (branch/tag name to commit) in a live db • Store the packfiles and never look at them • The server will do the diffs
  45. None
  46. Forks • Forks sync from parent • Parent gets PR

    from forks • Send “ haves ” for the entire network • Build a packfile dependency tree
  47. • Run at off-peak hours • Use the raw git

    protocol • Set user agents • Only fetch diffs, packs have to be used together
  48. None
  49. None
  50. None
  51. None
  52. Gigantic repositories

  53. Gigantic repositories 9GB repo. WTF.

  54. Disappearing repositories 401 Unauthorized

  55. Disappearing repositories DMCA $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git Cloning into 'rtmpdump-2.5'...

    fatal: remote error: Repository unavailable due to DMCA takedown. See the takedown notice for more details: https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump. md.
  56. Last minute crashes “Have you tried turning it off and

    on again?”
  57. The Backpanel

  58. The Backpanel Our admin UI! E.g. blacklist of excessively large

    repos, whitelist of exceptions to that, manually deleted repos … but building UIs sucks (for us, anyway)
  59. The Backpanel

  60. The Backpanel Lazy... But it works!

  61. The Backpanel https://trello.com/b/04pbw4Gv/blacklist

  62. The Frontend

  63. The Frontend • git clone interface • Web interface

  64. The Frontend • git clone interface • Web interface •

    Retrieval is expensive • Outbound bandwidth even more
  65. Local cache and alternates . ├── HEAD ├── config ├──

    objects/ │ └── info/ │ └── alternates ├── packed-refs └── refs/
  66. Clone one or all snapshots Exactly like it looked at

    a given time: $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt All the snapshots at once: $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt $ git branch 2016-07-01Z11:44:11/master 2016-07-02Z22:44:00/master
  67. One more step $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt Welcome to the

    Code Archive! Since download bandwidth is expensive, please click here to verify that you are human: https://codearchive.org/captcha/72f878a9670ab664 The download will start automatically...
  68. Web UI • Work in progress • Wayback machine style

    slider at the top • We suck at UIs. PRs welcomed!
  69. Things to come

  70. Beyond git and GitHub

  71. Hiding things :( • Login with GitHub and hide your

    repositories • Automated DMCA processing
  72. Long term storage B2 object storage. Sponsored by

  73. Thank you! https://codearchive.org Filippo Valsorda - @FiloSottile Salman Aljammaz -

    @saljam_