Slide 1

Slide 1 text

Internals of Git Internals Minqi Pan

Slide 2

Slide 2 text

I’m Minqi Pan github.com/pmq20 twitter @psvr

Slide 3

Slide 3 text

Synopsis • Internals of downloading code • Internals of uploading code

Slide 4

Slide 4 text

Question:
 Which commands to use?

Slide 5

Slide 5 text

140 commands add, am, annotate, apply, archimport, archive, bisect, blame, branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count- objects, credential, credential-cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt-merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http- push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls- tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-objects, pack- redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update- index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree

Slide 6

Slide 6 text

– Chapter 10. Git Internals “Git is fundamentally a content-addressable filesystem with a VCS user interface written on top of it.”

Slide 7

Slide 7 text

a content-addressable filesystem

Slide 8

Slide 8 text

simple
 straightforward
 a whole-object store

Slide 9

Slide 9 text

“Let there be packs!”

Slide 10

Slide 10 text

Pack-files • USAGE I: streaming • USAGE II: on-disk storage • w/ nice balance of density vs ease-of use • “Git really doesn't follow files”

Slide 11

Slide 11 text

Git Objects Types

Slide 12

Slide 12 text

Random Access • compressed just one object at a time • pessimizations due to double usage: low compression factor as a whole • able to translate object name to location in pack

Slide 13

Slide 13 text

Spotlights of today add, am, annotate, apply, archimport, archive, bisect, blame, branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count-objects, credential, credential- cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt- merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http-push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls-tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack- objects, pack-redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree

Slide 14

Slide 14 text

Remote Procedure Calls • git-fetch-pack —ssh,http—> git-upload-pack • git-send-pack —ssh,http—> git-receive-pack

Slide 15

Slide 15 text

Local UNIX-y Calls • git-upload-pack calls git-pack-objects • git-receive-pack calls git-unpack-objects • git-receive-pack also calls git-index-pack

Slide 16

Slide 16 text

Advantage of being UNIX-y Use of global variables anywhere Needless to pass around

Slide 17

Slide 17 text

git-upload-pack • first called with --stateless-rpc --advertise-refs • then called with --stateless-rpc • internally calls git-pack-objects

Slide 18

Slide 18 text

git-upload-pack --advertise-refs 001e# service=git-upload-pack 000000d10ffaa5a6fdf84714deddaecc62edd5540e0f5877 HEAD multi_ack thin- pack side-band side-band-64k ofs-delta shallow no-progress include-tag multi_ack_detailed no-done symref=HEAD:refs/heads/master agent=git/2.2.2 0057a576d8e0d128b503bb3d9980387fea0976ab6921 refs/heads/bug_fix/a 0053affee7f8adf11c7d895189ec5f0c3aeb95dba01a refs/heads/bug_fix/b 0053cbe2e212b415f587f086af6a74670919ccb7b089 refs/heads/bug_fix/c 004568df60003126a1f6478eead611af1c9000dfe44e refs/heads/feature/ci 0052c063094d4f94e182a23aa9a479bf67bf6e2de6b4 refs/heads/feature/d 004cfd02d43bc2ca02d381275453383e2c2629e33da5 refs/heads/feature/e 0049fdb1e470fd4d9be344dc77645bf777eec07d7940 refs/heads/feature/f … 0000

Slide 19

Slide 19 text

git-pack-objects • /usr/bin/git • pack-objects • --revs • --stdout • --progress • --delta-base-offset

Slide 20

Slide 20 text

pack-*.idx a simple index to do "object name -> location in pack-files” translations

Slide 21

Slide 21 text

pack-*.idx Data Structure

Slide 22

Slide 22 text

pack-*.idx Data Structure

Slide 23

Slide 23 text

pack-*.idx Data Structure A 256-entry fan-out table metaphorically 256 * 4-byte

Slide 24

Slide 24 text

pack-*.idx Data Structure 0x0309 == number of objects whose first byte of object name is less than or equal to 1 0x060D == number of objects whose first byte of object name is less than or equal to 2 …

Slide 25

Slide 25 text

pack-*.idx Data Structure Question: Why 256 entries in total?

Slide 26

Slide 26 text

pack-*.idx Data Structure Answer: size of a byte of the first 20 bytes of SHA-1

Slide 27

Slide 27 text

pack-*.idx Data Structure Question: Which byte has the number of total objects?

Slide 28

Slide 28 text

pack-*.idx Data Structure Answer: 8 + 255 * 4 ~ 8 + 256 * 4 1028 ~ 1032

Slide 29

Slide 29 text

pack-*.idx Data Structure Read it: 0x02f1f2 == 193010

Slide 30

Slide 30 text

pack-*.idx Data Structure sorted 20-byte SHA-1 object names

Slide 31

Slide 31 text

pack-*.idx Data Structure Question: Why Sorted?

Slide 32

Slide 32 text

pack-*.idx Data Structure Answer: Binary search after fan-out

Slide 33

Slide 33 text

pack-*.idx Data Structure Question: Ends where?

Slide 34

Slide 34 text

pack-*.idx Data Structure Answer: 1032 + 20 * 0x02f1f2
 ==
 3861232

Slide 35

Slide 35 text

pack-*.idx Data Structure A table of 4-byte CRC32 values of the packed object data.

Slide 36

Slide 36 text

pack-*.idx Data Structure Question: Ends where?

Slide 37

Slide 37 text

pack-*.idx Data Structure Answer: 3861232 + 4 * 0x02f1f2 == 4633272

Slide 38

Slide 38 text

pack-*.idx Data Structure 4-byte offset values MSB + 31-bit Most Significant Bit marks the usage of 8-byte table

Slide 39

Slide 39 text

pack-*.idx Data Structure Question: Any limitations?

Slide 40

Slide 40 text

pack-*.idx Data Structure Answer: Pack files can only be 2147483648 in size at best

Slide 41

Slide 41 text

pack-*.idx Data Structure Question: Ends where?

Slide 42

Slide 42 text

pack-*.idx Data Structure Answer: 4633272 + 4 * 0x02f1f2 == 5405312

Slide 43

Slide 43 text

pack-*.idx Data Structure A table of 8-byte offset entries. (empty for pack files less than 2 GiB) Question: Any limitations?

Slide 44

Slide 44 text

pack-*.idx Data Structure A table of 8-byte offset entries. (empty for pack files less than 2 GiB) Answer: No. What file on earth weights 16800000 TB? Well supports packs larger than 2 GiB

Slide 45

Slide 45 text

pack-*.idx Data Structure (trailer) A copy of the 20-byte SHA-1 checksum at the end of corresponding packfile. 20-byte SHA-1-checksum of all of the above.

Slide 46

Slide 46 text

pack-*.pack Data Structure header

Slide 47

Slide 47 text

pack-*.pack Data Structure

Slide 48

Slide 48 text

pack-*.pack Data Structure

Slide 49

Slide 49 text

pack-*.pack Data Structure

Slide 50

Slide 50 text

pack-*.pack Data Structure n * object entries

Slide 51

Slide 51 text

Example • First byte of the name is 0x9f • IDX[8 + (0x9f - 1) * 4] == 0x0403 == 1027 • IDX[8 + 0x9f * 4] == 0x0403 == 1029 • Object No. 1027 ~ 1029 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9

Slide 52

Slide 52 text

Example • Binary search 1027 ~ 1029 • Found at 8 + 4 * 256 + 1027 * 20 == 21572 • Skip the rest total_num*(20+4) == 1628*24 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9

Slide 53

Slide 53 text

Example • IDX[8 + 4 * 256 + 1628*24 + 4 * 1027] Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] == PACK[280621]

Slide 54

Slide 54 text

Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 E3 11100011 1_______ => MSB 1 continue _110____ => type == 6 == OFS_DELTA ____0011 => length == 3 3-bit type, (n-1)*7+4-bit length

Slide 55

Slide 55 text

Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 01 00000001 0_______ => MSB 0 break _0000001 => length += (1 << 4) final length == 19

Slide 56

Slide 56 text

Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] AA 10101010 1_______ MSB 1 continue _0101010 base offset == 42

Slide 57

Slide 57 text

Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 44 01000100 0_______ MSB 0 break _1000100 offset == ((42+1)<<7)+68 == 5572

Slide 58

Slide 58 text

Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 offset == 5572 push 0x0004482D into stack deal with (0x0004482D - 5572) push (0x0004482D - 5572) into stack … root base

Slide 59

Slide 59 text

Example SHA1 type size size-pack offset- pack depth base 9fcf811e00fa469 688943a9152c16d 4ee90fb9a9 blob 19 32 280621 4 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 blob 52 45 275049 3 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 blob 2935 1263 273786 2 a39920830904665 6ecc01f7653c5d5 b8905fc16e a39920830904665 6ecc01f7653c5d5 b8905fc16e blob 4686 1540 272246 1 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 blob 12635 3279 115703 0 -

Slide 60

Slide 60 text

– Linus Torvalds “I played around with different delta algorithms, and with making the delta window bigger, but having too big of a sliding window makes it very expensive to generate the pack: you need to compare every object with a _ton_ of other objects.”

Slide 61

Slide 61 text

– Linus Torvalds “ANY order will give you a working pack, ... [but it is] the thing that gives packs good locality. It keeps the objects close to the head (whether they are old or new, but they are _reachable_ from the head) at the head of the pack. So packs actually have absolutely _wonderful_ IO patterns.”

Slide 62

Slide 62 text

Packing Heuristics • First sort by delta order • Then sort by recency order

Slide 63

Slide 63 text

Delta Order Heuristics • first sort by type. Different objects never delta with each other. • we do not delta different object types.

Slide 64

Slide 64 text

Delta Order Heuristics • then sort by filename/dirname. • we prefer to delta the objects with the same full path, but allow files with the same name from different directories.

Slide 65

Slide 65 text

Delta Order Heuristics • then if we are doing "thin" pack, the objects we are _not_ going to pack but we know about are sorted earlier than other objects. • we always prefer to delta against objects we are not going to send, if there are some. • for "thin" packs only. used when the other side is known to have such objects.

Slide 66

Slide 66 text

Delta Order Heuristics • and finally sort by size, larger to smaller. • we prefer to delta against larger objects, so that we have lots of removals. • large->small matters because of compression behaviour.

Slide 67

Slide 67 text

sort2: recency sort 1: (type, basename, size)

Slide 68

Slide 68 text

Reconciling Two Sorts • Linus' law: files grow, larger objects tend to be "more recent" • we only write out the base object first if the delta against it was more recent • Thus the front of the pack always contains data that is relevant to a “recent" object

Slide 69

Slide 69 text

As a Result • delta order and recency order match each other quite well • xdelta, removing data is cheaper (in size) than adding data • xdelta, larger->small is actually a big space saver too

Slide 70

Slide 70 text

git-pack-objects w/ threads --threads= Specifies the number of threads to spawn when searching for best delta matches. This requires that pack-objects be compiled with pthreads otherwise this option is ignored with a warning. This is meant to reduce packing time on multiprocessor machines. The required amount of memory for the delta search window is however multiplied by the number of threads. Specifying 0 will cause Git to auto-detect the number of CPU's and set the number of threads accordingly.

Slide 71

Slide 71 text

git-receive-pack • First called with --advertise-refs • (ntohl(hdr.hdr_entries) < unpack_limit) =>
 git-unpack-objects • (ntohl(hdr.hdr_entries) >= unpack_limit) =>
 git-index-pack • Finally update refs with locks

Slide 72

Slide 72 text

Be safe

Slide 73

Slide 73 text

Be safe

Slide 74

Slide 74 text

Update refs with lock

Slide 75

Slide 75 text

Thank you https://github.com/pmq20/