$30 off During Our Annual Pro Sale. View Details »

Internals of Git Internals

Minqi Pan
January 07, 2016

Internals of Git Internals

Minqi Pan

January 07, 2016
Tweet

More Decks by Minqi Pan

Other Decks in Programming

Transcript

  1. Internals of Git Internals Minqi Pan

  2. I’m Minqi Pan github.com/pmq20 twitter @psvr

  3. Synopsis • Internals of downloading code • Internals of uploading

    code
  4. Question:
 Which commands to use?

  5. 140 commands add, am, annotate, apply, archimport, archive, bisect, blame,

    branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count- objects, credential, credential-cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt-merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http- push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls- tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-objects, pack- redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update- index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  6. – Chapter 10. Git Internals “Git is fundamentally a content-addressable

    filesystem with a VCS user interface written on top of it.”
  7. a content-addressable filesystem

  8. simple
 straightforward
 a whole-object store

  9. “Let there be packs!”

  10. Pack-files • USAGE I: streaming • USAGE II: on-disk storage

    • w/ nice balance of density vs ease-of use • “Git really doesn't follow files”
  11. Git Objects Types

  12. Random Access • compressed just one object at a time

    • pessimizations due to double usage: low compression factor as a whole • able to translate object name to location in pack
  13. Spotlights of today add, am, annotate, apply, archimport, archive, bisect,

    blame, branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count-objects, credential, credential- cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt- merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http-push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls-tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack- objects, pack-redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  14. Remote Procedure Calls • git-fetch-pack —ssh,http—> git-upload-pack • git-send-pack —ssh,http—>

    git-receive-pack
  15. Local UNIX-y Calls • git-upload-pack calls git-pack-objects • git-receive-pack calls

    git-unpack-objects • git-receive-pack also calls git-index-pack
  16. Advantage of being UNIX-y Use of global variables anywhere Needless

    to pass around
  17. git-upload-pack • first called with --stateless-rpc --advertise-refs • then called

    with --stateless-rpc • internally calls git-pack-objects
  18. git-upload-pack --advertise-refs 001e# service=git-upload-pack 000000d10ffaa5a6fdf84714deddaecc62edd5540e0f5877 HEAD multi_ack thin- pack side-band

    side-band-64k ofs-delta shallow no-progress include-tag multi_ack_detailed no-done symref=HEAD:refs/heads/master agent=git/2.2.2 0057a576d8e0d128b503bb3d9980387fea0976ab6921 refs/heads/bug_fix/a 0053affee7f8adf11c7d895189ec5f0c3aeb95dba01a refs/heads/bug_fix/b 0053cbe2e212b415f587f086af6a74670919ccb7b089 refs/heads/bug_fix/c 004568df60003126a1f6478eead611af1c9000dfe44e refs/heads/feature/ci 0052c063094d4f94e182a23aa9a479bf67bf6e2de6b4 refs/heads/feature/d 004cfd02d43bc2ca02d381275453383e2c2629e33da5 refs/heads/feature/e 0049fdb1e470fd4d9be344dc77645bf777eec07d7940 refs/heads/feature/f … 0000
  19. git-pack-objects • /usr/bin/git • pack-objects • --revs • --stdout •

    --progress • --delta-base-offset
  20. pack-*.idx a simple index to do "object name -> location

    in pack-files” translations
  21. pack-*.idx Data Structure

  22. pack-*.idx Data Structure

  23. pack-*.idx Data Structure A 256-entry fan-out table metaphorically 256 *

    4-byte
  24. pack-*.idx Data Structure 0x0309 == number of objects whose first

    byte of object name is less than or equal to 1 0x060D == number of objects whose first byte of object name is less than or equal to 2 …
  25. pack-*.idx Data Structure Question: Why 256 entries in total?

  26. pack-*.idx Data Structure Answer: size of a byte of the

    first 20 bytes of SHA-1
  27. pack-*.idx Data Structure Question: Which byte has the number of

    total objects?
  28. pack-*.idx Data Structure Answer: 8 + 255 * 4 ~

    8 + 256 * 4 1028 ~ 1032
  29. pack-*.idx Data Structure Read it: 0x02f1f2 == 193010

  30. pack-*.idx Data Structure sorted 20-byte SHA-1 object names

  31. pack-*.idx Data Structure Question: Why Sorted?

  32. pack-*.idx Data Structure Answer: Binary search after fan-out

  33. pack-*.idx Data Structure Question: Ends where?

  34. pack-*.idx Data Structure Answer: 1032 + 20 * 0x02f1f2
 ==


    3861232
  35. pack-*.idx Data Structure A table of 4-byte CRC32 values of

    the packed object data.
  36. pack-*.idx Data Structure Question: Ends where?

  37. pack-*.idx Data Structure Answer: 3861232 + 4 * 0x02f1f2 ==

    4633272
  38. pack-*.idx Data Structure 4-byte offset values MSB + 31-bit Most

    Significant Bit marks the usage of 8-byte table
  39. pack-*.idx Data Structure Question: Any limitations?

  40. pack-*.idx Data Structure Answer: Pack files can only be 2147483648

    in size at best
  41. pack-*.idx Data Structure Question: Ends where?

  42. pack-*.idx Data Structure Answer: 4633272 + 4 * 0x02f1f2 ==

    5405312
  43. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Question: Any limitations?
  44. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Answer: No. What file on earth weights 16800000 TB? Well supports packs larger than 2 GiB
  45. pack-*.idx Data Structure (trailer) A copy of the 20-byte SHA-1

    checksum at the end of corresponding packfile. 20-byte SHA-1-checksum of all of the above.
  46. pack-*.pack Data Structure header

  47. pack-*.pack Data Structure

  48. pack-*.pack Data Structure

  49. pack-*.pack Data Structure

  50. pack-*.pack Data Structure n * object entries

  51. Example • First byte of the name is 0x9f •

    IDX[8 + (0x9f - 1) * 4] == 0x0403 == 1027 • IDX[8 + 0x9f * 4] == 0x0403 == 1029 • Object No. 1027 ~ 1029 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  52. Example • Binary search 1027 ~ 1029 • Found at

    8 + 4 * 256 + 1027 * 20 == 21572 • Skip the rest total_num*(20+4) == 1628*24 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  53. Example • IDX[8 + 4 * 256 + 1628*24 +

    4 * 1027] Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] == PACK[280621]
  54. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 E3 11100011 1_______ => MSB 1 continue

    _110____ => type == 6 == OFS_DELTA ____0011 => length == 3 3-bit type, (n-1)*7+4-bit length
  55. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 01 00000001 0_______ => MSB

    0 break _0000001 => length += (1 << 4) final length == 19
  56. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] AA 10101010 1_______ MSB 1

    continue _0101010 base offset == 42
  57. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 44 01000100 0_______ MSB 0

    break _1000100 offset == ((42+1)<<7)+68 == 5572
  58. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 offset == 5572 push 0x0004482D into stack

    deal with (0x0004482D - 5572) push (0x0004482D - 5572) into stack … root base
  59. Example SHA1 type size size-pack offset- pack depth base 9fcf811e00fa469

    688943a9152c16d 4ee90fb9a9 blob 19 32 280621 4 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 blob 52 45 275049 3 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 blob 2935 1263 273786 2 a39920830904665 6ecc01f7653c5d5 b8905fc16e a39920830904665 6ecc01f7653c5d5 b8905fc16e blob 4686 1540 272246 1 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 blob 12635 3279 115703 0 -
  60. – Linus Torvalds “I played around with different delta algorithms,

    and with making the delta window bigger, but having too big of a sliding window makes it very expensive to generate the pack: you need to compare every object with a _ton_ of other objects.”
  61. – Linus Torvalds “ANY order will give you a working

    pack, ... [but it is] the thing that gives packs good locality. It keeps the objects close to the head (whether they are old or new, but they are _reachable_ from the head) at the head of the pack. So packs actually have absolutely _wonderful_ IO patterns.”
  62. Packing Heuristics • First sort by delta order • Then

    sort by recency order
  63. Delta Order Heuristics • first sort by type. Different objects

    never delta with each other. • we do not delta different object types.
  64. Delta Order Heuristics • then sort by filename/dirname. • we

    prefer to delta the objects with the same full path, but allow files with the same name from different directories.
  65. Delta Order Heuristics • then if we are doing "thin"

    pack, the objects we are _not_ going to pack but we know about are sorted earlier than other objects. • we always prefer to delta against objects we are not going to send, if there are some. • for "thin" packs only. used when the other side is known to have such objects.
  66. Delta Order Heuristics • and finally sort by size, larger

    to smaller. • we prefer to delta against larger objects, so that we have lots of removals. • large->small matters because of compression behaviour.
  67. sort2: recency sort 1: (type, basename, size)

  68. Reconciling Two Sorts • Linus' law: files grow, larger objects

    tend to be "more recent" • we only write out the base object first if the delta against it was more recent • Thus the front of the pack always contains data that is relevant to a “recent" object
  69. As a Result • delta order and recency order match

    each other quite well • xdelta, removing data is cheaper (in size) than adding data • xdelta, larger->small is actually a big space saver too
  70. git-pack-objects w/ threads --threads=<n> Specifies the number of threads to

    spawn when searching for best delta matches. This requires that pack-objects be compiled with pthreads otherwise this option is ignored with a warning. This is meant to reduce packing time on multiprocessor machines. The required amount of memory for the delta search window is however multiplied by the number of threads. Specifying 0 will cause Git to auto-detect the number of CPU's and set the number of threads accordingly.
  71. git-receive-pack • First called with --advertise-refs • (ntohl(hdr.hdr_entries) < unpack_limit)

    =>
 git-unpack-objects • (ntohl(hdr.hdr_entries) >= unpack_limit) =>
 git-index-pack • Finally update refs with locks
  72. Be safe

  73. Be safe

  74. Update refs with lock

  75. Thank you https://github.com/pmq20/