Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Internals of Git Internals

Minqi Pan
January 07, 2016

Internals of Git Internals

Minqi Pan

January 07, 2016
Tweet

More Decks by Minqi Pan

Other Decks in Programming

Transcript

  1. 140 commands add, am, annotate, apply, archimport, archive, bisect, blame,

    branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count- objects, credential, credential-cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt-merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http- push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls- tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-objects, pack- redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update- index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  2. – Chapter 10. Git Internals “Git is fundamentally a content-addressable

    filesystem with a VCS user interface written on top of it.”
  3. Pack-files • USAGE I: streaming • USAGE II: on-disk storage

    • w/ nice balance of density vs ease-of use • “Git really doesn't follow files”
  4. Random Access • compressed just one object at a time

    • pessimizations due to double usage: low compression factor as a whole • able to translate object name to location in pack
  5. Spotlights of today add, am, annotate, apply, archimport, archive, bisect,

    blame, branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count-objects, credential, credential- cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt- merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http-push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls-tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack- objects, pack-redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  6. Local UNIX-y Calls • git-upload-pack calls git-pack-objects • git-receive-pack calls

    git-unpack-objects • git-receive-pack also calls git-index-pack
  7. git-upload-pack • first called with --stateless-rpc --advertise-refs • then called

    with --stateless-rpc • internally calls git-pack-objects
  8. git-upload-pack --advertise-refs 001e# service=git-upload-pack 000000d10ffaa5a6fdf84714deddaecc62edd5540e0f5877 HEAD multi_ack thin- pack side-band

    side-band-64k ofs-delta shallow no-progress include-tag multi_ack_detailed no-done symref=HEAD:refs/heads/master agent=git/2.2.2 0057a576d8e0d128b503bb3d9980387fea0976ab6921 refs/heads/bug_fix/a 0053affee7f8adf11c7d895189ec5f0c3aeb95dba01a refs/heads/bug_fix/b 0053cbe2e212b415f587f086af6a74670919ccb7b089 refs/heads/bug_fix/c 004568df60003126a1f6478eead611af1c9000dfe44e refs/heads/feature/ci 0052c063094d4f94e182a23aa9a479bf67bf6e2de6b4 refs/heads/feature/d 004cfd02d43bc2ca02d381275453383e2c2629e33da5 refs/heads/feature/e 0049fdb1e470fd4d9be344dc77645bf777eec07d7940 refs/heads/feature/f … 0000
  9. pack-*.idx Data Structure 0x0309 == number of objects whose first

    byte of object name is less than or equal to 1 0x060D == number of objects whose first byte of object name is less than or equal to 2 …
  10. pack-*.idx Data Structure 4-byte offset values MSB + 31-bit Most

    Significant Bit marks the usage of 8-byte table
  11. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Question: Any limitations?
  12. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Answer: No. What file on earth weights 16800000 TB? Well supports packs larger than 2 GiB
  13. pack-*.idx Data Structure (trailer) A copy of the 20-byte SHA-1

    checksum at the end of corresponding packfile. 20-byte SHA-1-checksum of all of the above.
  14. Example • First byte of the name is 0x9f •

    IDX[8 + (0x9f - 1) * 4] == 0x0403 == 1027 • IDX[8 + 0x9f * 4] == 0x0403 == 1029 • Object No. 1027 ~ 1029 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  15. Example • Binary search 1027 ~ 1029 • Found at

    8 + 4 * 256 + 1027 * 20 == 21572 • Skip the rest total_num*(20+4) == 1628*24 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  16. Example • IDX[8 + 4 * 256 + 1628*24 +

    4 * 1027] Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] == PACK[280621]
  17. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 E3 11100011 1_______ => MSB 1 continue

    _110____ => type == 6 == OFS_DELTA ____0011 => length == 3 3-bit type, (n-1)*7+4-bit length
  18. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 offset == 5572 push 0x0004482D into stack

    deal with (0x0004482D - 5572) push (0x0004482D - 5572) into stack … root base
  19. Example SHA1 type size size-pack offset- pack depth base 9fcf811e00fa469

    688943a9152c16d 4ee90fb9a9 blob 19 32 280621 4 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 blob 52 45 275049 3 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 blob 2935 1263 273786 2 a39920830904665 6ecc01f7653c5d5 b8905fc16e a39920830904665 6ecc01f7653c5d5 b8905fc16e blob 4686 1540 272246 1 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 blob 12635 3279 115703 0 -
  20. – Linus Torvalds “I played around with different delta algorithms,

    and with making the delta window bigger, but having too big of a sliding window makes it very expensive to generate the pack: you need to compare every object with a _ton_ of other objects.”
  21. – Linus Torvalds “ANY order will give you a working

    pack, ... [but it is] the thing that gives packs good locality. It keeps the objects close to the head (whether they are old or new, but they are _reachable_ from the head) at the head of the pack. So packs actually have absolutely _wonderful_ IO patterns.”
  22. Delta Order Heuristics • first sort by type. Different objects

    never delta with each other. • we do not delta different object types.
  23. Delta Order Heuristics • then sort by filename/dirname. • we

    prefer to delta the objects with the same full path, but allow files with the same name from different directories.
  24. Delta Order Heuristics • then if we are doing "thin"

    pack, the objects we are _not_ going to pack but we know about are sorted earlier than other objects. • we always prefer to delta against objects we are not going to send, if there are some. • for "thin" packs only. used when the other side is known to have such objects.
  25. Delta Order Heuristics • and finally sort by size, larger

    to smaller. • we prefer to delta against larger objects, so that we have lots of removals. • large->small matters because of compression behaviour.
  26. Reconciling Two Sorts • Linus' law: files grow, larger objects

    tend to be "more recent" • we only write out the base object first if the delta against it was more recent • Thus the front of the pack always contains data that is relevant to a “recent" object
  27. As a Result • delta order and recency order match

    each other quite well • xdelta, removing data is cheaper (in size) than adding data • xdelta, larger->small is actually a big space saver too
  28. git-pack-objects w/ threads --threads=<n> Specifies the number of threads to

    spawn when searching for best delta matches. This requires that pack-objects be compiled with pthreads otherwise this option is ignored with a warning. This is meant to reduce packing time on multiprocessor machines. The required amount of memory for the delta search window is however multiplied by the number of threads. Specifying 0 will cause Git to auto-detect the number of CPU's and set the number of threads accordingly.
  29. git-receive-pack • First called with --advertise-refs • (ntohl(hdr.hdr_entries) < unpack_limit)

    =>
 git-unpack-objects • (ntohl(hdr.hdr_entries) >= unpack_limit) =>
 git-index-pack • Finally update refs with locks