Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Code Archive - HOPE XI

The Code Archive - HOPE XI

Recording: https://vimeo.com/177318837

Filippo Valsorda, Salman Aljammaz

Archiving web pages is hard. Crawling, images, assets... Javascript! But archiving code is not. It comes as content-addressed objects neatly packaged in repositories and tagged with refs. It compresses well. Changes can be detected in real time with the GitHub Firehose API. Nevertheless, we need to do it today while the host is healthy, and not wait for it to start bundling adware or slowly fade away. Otherwise, in ten years we'll find ourselves running unreproducible binaries on Javascript emulators, or unable to build the software that could recover all our pictures because that one dependency is missing. This is a talk about building The Code Archive, a Wayback Machine for git. Every time a repository changes on GitHub, Code Archive systems fetch it and archive all the files, commits, tags, and branches as they were at that time. Then you can clone a repository as it was at any point in time, even if the original has been rebased, has disappeared, or GitHub is down. There's a lot of fun to be had when (ab)using the git protocol to clone and pull millions of repositories to the same database. Speakers will show what git looks like on the wire and how fetches are optimized. Also, all the Go code powering the Archive is available... on GitHub.

Filippo Valsorda

July 22, 2016
Tweet

More Decks by Filippo Valsorda

Other Decks in Programming

Transcript

  1. The Code Archive
    Clone ALL the code.

    View Slide

  2. @FiloSottile
    Filippo Valsorda
    @saljam_
    Salman Aljammaz

    View Slide

  3. View Slide

  4. View Slide

  5. View Slide

  6. GitHub alone has
    34,000,000 repositories
    14,000,000 users

    View Slide

  7. Go
    import (
    "github.com/miekg/dns"
    "golang.org/x/exp/io/i2c"
    )

    View Slide

  8. Repositories get deleted.

    View Slide

  9. Branches get rebased.

    View Slide

  10. Histories get rewritten.

    View Slide

  11. Services go down.

    View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. Services disappear.

    View Slide

  18. We <3 GitHub!!

    View Slide

  19. View Slide

  20. We want a Wayback Machine for code

    View Slide

  21. And we built it!
    There are
    390,123 snapshots
    of 91,396 repositories
    totalling 1.025 terabytes
    As of 2016-07-22

    View Slide

  22. The prototype

    GitHub only

    Active repositories (fetched on push)

    Popular repositories (at least 10

    )

    Reasonable repositories size

    View Slide

  23. Architecture
    Drinker Fetcher
    Pack blob
    storage
    GH
    API
    Queue
    git pull
    Frontend

    View Slide

  24. The Drinker
    Drink the GitHub Firehose!

    View Slide

  25. The Drinker

    Monitor firehose for push, create, open source events

    Queue repositories

    Filter by number of stars

    View Slide

  26. The Drinker

    Monitor firehose for push, create, open source events

    Queue repositories

    Filter by number of stars

    GitHub API rate limit: 5K / hour

    Drink from https://www.githubarchive.org/

    Cache number of stars, update via events

    View Slide

  27. View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. View Slide

  32. Cache size: 7 million

    View Slide

  33. https://github.com/google/go-github/pull/317

    View Slide

  34. The Fetcher
    Just fetch the repos!

    View Slide

  35. The Fetcher
    Just fetch the repos!
    But fetch to what?

    View Slide

  36. View Slide

  37. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...]
    003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master
    003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3
    0000
    Server → client
    Client → server
    003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...]
    0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
    0009done
    0000
    Server → client: [packfile]

    View Slide

  38. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb
    master → 867387cf1373910c60c78cab81085cb846fadfdb
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    2016-05-03 09:55:11 Z

    View Slide

  39. HEAD → 867387cf1373910c60c78cab81085cb846fadfdb
    master → 867387cf1373910c60c78cab81085cb846fadfdb
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    HEAD → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0
    master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0
    v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    2016-05-03 10:55:33 Z
    2016-05-03 09:55:11 Z

    View Slide

  40. View Slide

  41. View Slide

  42. View Slide

  43. 00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...]
    003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master
    003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3
    0000
    Server → client
    Client → server
    003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...]
    0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c
    0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
    0009done
    0000
    Server → client: [packfile]

    View Slide

  44. Cold cool storage

    Upload is cheap, retrieval and download is expensive

    Keep the refs (branch/tag name to commit) in a live db

    Store the packfiles and never look at them

    The server will do the diffs

    View Slide

  45. View Slide

  46. Forks

    Forks sync from parent

    Parent gets PR from forks

    Send “
    haves
    ” for the entire network

    Build a packfile dependency tree

    View Slide


  47. Run at off-peak hours

    Use the raw git protocol

    Set user agents

    Only fetch diffs, packs have to be used together

    View Slide

  48. View Slide

  49. View Slide

  50. View Slide

  51. View Slide

  52. Gigantic repositories

    View Slide

  53. Gigantic repositories
    9GB repo. WTF.

    View Slide

  54. Disappearing repositories
    401 Unauthorized

    View Slide

  55. Disappearing repositories
    DMCA
    $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git
    Cloning into 'rtmpdump-2.5'...
    fatal: remote error:
    Repository unavailable due to DMCA takedown.
    See the takedown notice for more details:
    https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump.
    md.

    View Slide

  56. Last minute crashes
    “Have you tried turning it off and on again?”

    View Slide

  57. The Backpanel

    View Slide

  58. The Backpanel
    Our admin UI!
    E.g. blacklist of excessively large repos,
    whitelist of exceptions to that,
    manually deleted repos
    … but building UIs sucks (for us, anyway)

    View Slide

  59. The Backpanel

    View Slide

  60. The Backpanel
    Lazy...
    But it works!

    View Slide

  61. The Backpanel
    https://trello.com/b/04pbw4Gv/blacklist

    View Slide

  62. The Frontend

    View Slide

  63. The Frontend
    ● git clone
    interface

    Web interface

    View Slide

  64. The Frontend
    ● git clone
    interface

    Web interface

    Retrieval is expensive

    Outbound bandwidth even more

    View Slide

  65. Local cache and alternates
    .
    ├── HEAD
    ├── config
    ├── objects/
    │ └── info/
    │ └── alternates
    ├── packed-refs
    └── refs/

    View Slide

  66. Clone one or all snapshots
    Exactly like it looked at a given time:
    $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt
    All the snapshots at once:
    $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt
    $ git branch
    2016-07-01Z11:44:11/master
    2016-07-02Z22:44:00/master

    View Slide

  67. One more step
    $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt
    Welcome to the Code Archive!
    Since download bandwidth is expensive, please click here to verify that
    you are human:
    https://codearchive.org/captcha/72f878a9670ab664
    The download will start automatically...

    View Slide

  68. Web UI

    Work in progress

    Wayback machine style slider at the top

    We suck at UIs. PRs welcomed!

    View Slide

  69. Things to come

    View Slide

  70. Beyond git and GitHub

    View Slide

  71. Hiding things :(

    Login with GitHub and hide your repositories

    Automated DMCA processing

    View Slide

  72. Long term storage
    B2 object storage.
    Sponsored by

    View Slide

  73. Thank you!
    https://codearchive.org
    Filippo Valsorda - @FiloSottile
    Salman Aljammaz - @saljam_

    View Slide