Slide 1

Slide 1 text

The Code Archive Clone ALL the code.

Slide 2

Slide 2 text

@FiloSottile Filippo Valsorda @saljam_ Salman Aljammaz

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

GitHub alone has 34,000,000 repositories 14,000,000 users

Slide 7

Slide 7 text

Go import ( "github.com/miekg/dns" "golang.org/x/exp/io/i2c" )

Slide 8

Slide 8 text

Repositories get deleted.

Slide 9

Slide 9 text

Branches get rebased.

Slide 10

Slide 10 text

Histories get rewritten.

Slide 11

Slide 11 text

Services go down.

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Services disappear.

Slide 18

Slide 18 text

We <3 GitHub!!

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

We want a Wayback Machine for code

Slide 21

Slide 21 text

And we built it! There are 390,123 snapshots of 91,396 repositories totalling 1.025 terabytes As of 2016-07-22

Slide 22

Slide 22 text

The prototype ● GitHub only ● Active repositories (fetched on push) ● Popular repositories (at least 10 ★ ) ● Reasonable repositories size

Slide 23

Slide 23 text

Architecture Drinker Fetcher Pack blob storage GH API Queue git pull Frontend

Slide 24

Slide 24 text

The Drinker Drink the GitHub Firehose!

Slide 25

Slide 25 text

The Drinker ● Monitor firehose for push, create, open source events ● Queue repositories ● Filter by number of stars

Slide 26

Slide 26 text

The Drinker ● Monitor firehose for push, create, open source events ● Queue repositories ● Filter by number of stars ● GitHub API rate limit: 5K / hour ● Drink from https://www.githubarchive.org/ ● Cache number of stars, update via events

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Cache size: 7 million

Slide 33

Slide 33 text

https://github.com/google/go-github/pull/317

Slide 34

Slide 34 text

The Fetcher Just fetch the repos!

Slide 35

Slide 35 text

The Fetcher Just fetch the repos! But fetch to what?

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]

Slide 38

Slide 38 text

HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03 09:55:11 Z

Slide 39

Slide 39 text

HEAD → 867387cf1373910c60c78cab81085cb846fadfdb master → 867387cf1373910c60c78cab81085cb846fadfdb v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c HEAD → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 master → 5b53898f17dda3d2af6bc599b45b0d7b76f900f0 v1.2.3 → 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 2016-05-03 10:55:33 Z 2016-05-03 09:55:11 Z

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

00df867387cf1373910c60c78cab81085cb846fadfdb HEAD□[...] 003f867387cf1373910c60c78cab81085cb846fadfdb refs/heads/master 003f236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c refs/tags/v1.2.3 0000 Server → client Client → server 003cwant 867387cf1373910c60c78cab81085cb846fadfdb [...] 0032want 236f9b3eebd2f3e743b6ec117dd95e4b2857dd8c 0032have 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7 0009done 0000 Server → client: [packfile]

Slide 44

Slide 44 text

Cold cool storage ● Upload is cheap, retrieval and download is expensive ● Keep the refs (branch/tag name to commit) in a live db ● Store the packfiles and never look at them ● The server will do the diffs

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Forks ● Forks sync from parent ● Parent gets PR from forks ● Send “ haves ” for the entire network ● Build a packfile dependency tree

Slide 47

Slide 47 text

● Run at off-peak hours ● Use the raw git protocol ● Set user agents ● Only fetch diffs, packs have to be used together

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

Gigantic repositories

Slide 53

Slide 53 text

Gigantic repositories 9GB repo. WTF.

Slide 54

Slide 54 text

Disappearing repositories 401 Unauthorized

Slide 55

Slide 55 text

Disappearing repositories DMCA $ git clone git://github.com/rtmpdump/rtmpdump-2.5.git Cloning into 'rtmpdump-2.5'... fatal: remote error: Repository unavailable due to DMCA takedown. See the takedown notice for more details: https://github.com/github/dmca/blob/master/2016-07-22-rtmpdump. md.

Slide 56

Slide 56 text

Last minute crashes “Have you tried turning it off and on again?”

Slide 57

Slide 57 text

The Backpanel

Slide 58

Slide 58 text

The Backpanel Our admin UI! E.g. blacklist of excessively large repos, whitelist of exceptions to that, manually deleted repos … but building UIs sucks (for us, anyway)

Slide 59

Slide 59 text

The Backpanel

Slide 60

Slide 60 text

The Backpanel Lazy... But it works!

Slide 61

Slide 61 text

The Backpanel https://trello.com/b/04pbw4Gv/blacklist

Slide 62

Slide 62 text

The Frontend

Slide 63

Slide 63 text

The Frontend ● git clone interface ● Web interface

Slide 64

Slide 64 text

The Frontend ● git clone interface ● Web interface ● Retrieval is expensive ● Outbound bandwidth even more

Slide 65

Slide 65 text

Local cache and alternates . ├── HEAD ├── config ├── objects/ │ └── info/ │ └── alternates ├── packed-refs └── refs/

Slide 66

Slide 66 text

Clone one or all snapshots Exactly like it looked at a given time: $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt All the snapshots at once: $ git clone https://codearchive.org/all/github.com/FiloSottile/gvt $ git branch 2016-07-01Z11:44:11/master 2016-07-02Z22:44:00/master

Slide 67

Slide 67 text

One more step $ git clone https://codearchive.org/2016-07-01/github.com/FiloSottile/gvt Welcome to the Code Archive! Since download bandwidth is expensive, please click here to verify that you are human: https://codearchive.org/captcha/72f878a9670ab664 The download will start automatically...

Slide 68

Slide 68 text

Web UI ● Work in progress ● Wayback machine style slider at the top ● We suck at UIs. PRs welcomed!

Slide 69

Slide 69 text

Things to come

Slide 70

Slide 70 text

Beyond git and GitHub

Slide 71

Slide 71 text

Hiding things :( ● Login with GitHub and hide your repositories ● Automated DMCA processing

Slide 72

Slide 72 text

Long term storage B2 object storage. Sponsored by

Slide 73

Slide 73 text

Thank you! https://codearchive.org Filippo Valsorda - @FiloSottile Salman Aljammaz - @saljam_