Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Internals of Git Internals

8002c84eb4c18170632f8fb7efb09288?s=47 Minqi Pan
January 07, 2016

Internals of Git Internals

8002c84eb4c18170632f8fb7efb09288?s=128

Minqi Pan

January 07, 2016
Tweet

Transcript

  1. Internals of Git Internals Minqi Pan

  2. I’m Minqi Pan github.com/pmq20 twitter @psvr

  3. Synopsis • Internals of downloading code • Internals of uploading

    code
  4. Question:
 Which commands to use?

  5. 140 commands add, am, annotate, apply, archimport, archive, bisect, blame,

    branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count- objects, credential, credential-cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt-merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http- push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls- tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-objects, pack- redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update- index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  6. – Chapter 10. Git Internals “Git is fundamentally a content-addressable

    filesystem with a VCS user interface written on top of it.”
  7. a content-addressable filesystem

  8. simple
 straightforward
 a whole-object store

  9. “Let there be packs!”

  10. Pack-files • USAGE I: streaming • USAGE II: on-disk storage

    • w/ nice balance of density vs ease-of use • “Git really doesn't follow files”
  11. Git Objects Types

  12. Random Access • compressed just one object at a time

    • pessimizations due to double usage: low compression factor as a whole • able to translate object name to location in pack
  13. Spotlights of today add, am, annotate, apply, archimport, archive, bisect,

    blame, branch, bundle, cat-file, check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count-objects, credential, credential- cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt- merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http-push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls-tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack- objects, pack-redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push, quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace, request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell, shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag, whatchanged, worktree, write-tree
  14. Remote Procedure Calls • git-fetch-pack —ssh,http—> git-upload-pack • git-send-pack —ssh,http—>

    git-receive-pack
  15. Local UNIX-y Calls • git-upload-pack calls git-pack-objects • git-receive-pack calls

    git-unpack-objects • git-receive-pack also calls git-index-pack
  16. Advantage of being UNIX-y Use of global variables anywhere Needless

    to pass around
  17. git-upload-pack • first called with --stateless-rpc --advertise-refs • then called

    with --stateless-rpc • internally calls git-pack-objects
  18. git-upload-pack --advertise-refs 001e# service=git-upload-pack 000000d10ffaa5a6fdf84714deddaecc62edd5540e0f5877 HEAD multi_ack thin- pack side-band

    side-band-64k ofs-delta shallow no-progress include-tag multi_ack_detailed no-done symref=HEAD:refs/heads/master agent=git/2.2.2 0057a576d8e0d128b503bb3d9980387fea0976ab6921 refs/heads/bug_fix/a 0053affee7f8adf11c7d895189ec5f0c3aeb95dba01a refs/heads/bug_fix/b 0053cbe2e212b415f587f086af6a74670919ccb7b089 refs/heads/bug_fix/c 004568df60003126a1f6478eead611af1c9000dfe44e refs/heads/feature/ci 0052c063094d4f94e182a23aa9a479bf67bf6e2de6b4 refs/heads/feature/d 004cfd02d43bc2ca02d381275453383e2c2629e33da5 refs/heads/feature/e 0049fdb1e470fd4d9be344dc77645bf777eec07d7940 refs/heads/feature/f … 0000
  19. git-pack-objects • /usr/bin/git • pack-objects • --revs • --stdout •

    --progress • --delta-base-offset
  20. pack-*.idx a simple index to do "object name -> location

    in pack-files” translations
  21. pack-*.idx Data Structure

  22. pack-*.idx Data Structure

  23. pack-*.idx Data Structure A 256-entry fan-out table metaphorically 256 *

    4-byte
  24. pack-*.idx Data Structure 0x0309 == number of objects whose first

    byte of object name is less than or equal to 1 0x060D == number of objects whose first byte of object name is less than or equal to 2 …
  25. pack-*.idx Data Structure Question: Why 256 entries in total?

  26. pack-*.idx Data Structure Answer: size of a byte of the

    first 20 bytes of SHA-1
  27. pack-*.idx Data Structure Question: Which byte has the number of

    total objects?
  28. pack-*.idx Data Structure Answer: 8 + 255 * 4 ~

    8 + 256 * 4 1028 ~ 1032
  29. pack-*.idx Data Structure Read it: 0x02f1f2 == 193010

  30. pack-*.idx Data Structure sorted 20-byte SHA-1 object names

  31. pack-*.idx Data Structure Question: Why Sorted?

  32. pack-*.idx Data Structure Answer: Binary search after fan-out

  33. pack-*.idx Data Structure Question: Ends where?

  34. pack-*.idx Data Structure Answer: 1032 + 20 * 0x02f1f2
 ==


    3861232
  35. pack-*.idx Data Structure A table of 4-byte CRC32 values of

    the packed object data.
  36. pack-*.idx Data Structure Question: Ends where?

  37. pack-*.idx Data Structure Answer: 3861232 + 4 * 0x02f1f2 ==

    4633272
  38. pack-*.idx Data Structure 4-byte offset values MSB + 31-bit Most

    Significant Bit marks the usage of 8-byte table
  39. pack-*.idx Data Structure Question: Any limitations?

  40. pack-*.idx Data Structure Answer: Pack files can only be 2147483648

    in size at best
  41. pack-*.idx Data Structure Question: Ends where?

  42. pack-*.idx Data Structure Answer: 4633272 + 4 * 0x02f1f2 ==

    5405312
  43. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Question: Any limitations?
  44. pack-*.idx Data Structure A table of 8-byte offset entries. (empty

    for pack files less than 2 GiB) Answer: No. What file on earth weights 16800000 TB? Well supports packs larger than 2 GiB
  45. pack-*.idx Data Structure (trailer) A copy of the 20-byte SHA-1

    checksum at the end of corresponding packfile. 20-byte SHA-1-checksum of all of the above.
  46. pack-*.pack Data Structure header

  47. pack-*.pack Data Structure

  48. pack-*.pack Data Structure

  49. pack-*.pack Data Structure

  50. pack-*.pack Data Structure n * object entries

  51. Example • First byte of the name is 0x9f •

    IDX[8 + (0x9f - 1) * 4] == 0x0403 == 1027 • IDX[8 + 0x9f * 4] == 0x0403 == 1029 • Object No. 1027 ~ 1029 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  52. Example • Binary search 1027 ~ 1029 • Found at

    8 + 4 * 256 + 1027 * 20 == 21572 • Skip the rest total_num*(20+4) == 1628*24 Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
  53. Example • IDX[8 + 4 * 256 + 1628*24 +

    4 * 1027] Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] == PACK[280621]
  54. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 E3 11100011 1_______ => MSB 1 continue

    _110____ => type == 6 == OFS_DELTA ____0011 => length == 3 3-bit type, (n-1)*7+4-bit length
  55. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 01 00000001 0_______ => MSB

    0 break _0000001 => length += (1 << 4) final length == 19
  56. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] AA 10101010 1_______ MSB 1

    continue _0101010 base offset == 42
  57. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 • PACK[0x0004482D] 44 01000100 0_______ MSB 0

    break _1000100 offset == ((42+1)<<7)+68 == 5572
  58. Example Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9 offset == 5572 push 0x0004482D into stack

    deal with (0x0004482D - 5572) push (0x0004482D - 5572) into stack … root base
  59. Example SHA1 type size size-pack offset- pack depth base 9fcf811e00fa469

    688943a9152c16d 4ee90fb9a9 blob 19 32 280621 4 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 6110c89446f2281 e5db9b798a0fa02 0fad6e63e1 blob 52 45 275049 3 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 3bbeff3fc22b75c 1a26f4ab9b64449 b33002aea5 blob 2935 1263 273786 2 a39920830904665 6ecc01f7653c5d5 b8905fc16e a39920830904665 6ecc01f7653c5d5 b8905fc16e blob 4686 1540 272246 1 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 e4e56117de8b3bd 0bd899701da4712 caee27c7d6 blob 12635 3279 115703 0 -
  60. – Linus Torvalds “I played around with different delta algorithms,

    and with making the delta window bigger, but having too big of a sliding window makes it very expensive to generate the pack: you need to compare every object with a _ton_ of other objects.”
  61. – Linus Torvalds “ANY order will give you a working

    pack, ... [but it is] the thing that gives packs good locality. It keeps the objects close to the head (whether they are old or new, but they are _reachable_ from the head) at the head of the pack. So packs actually have absolutely _wonderful_ IO patterns.”
  62. Packing Heuristics • First sort by delta order • Then

    sort by recency order
  63. Delta Order Heuristics • first sort by type. Different objects

    never delta with each other. • we do not delta different object types.
  64. Delta Order Heuristics • then sort by filename/dirname. • we

    prefer to delta the objects with the same full path, but allow files with the same name from different directories.
  65. Delta Order Heuristics • then if we are doing "thin"

    pack, the objects we are _not_ going to pack but we know about are sorted earlier than other objects. • we always prefer to delta against objects we are not going to send, if there are some. • for "thin" packs only. used when the other side is known to have such objects.
  66. Delta Order Heuristics • and finally sort by size, larger

    to smaller. • we prefer to delta against larger objects, so that we have lots of removals. • large->small matters because of compression behaviour.
  67. sort2: recency sort 1: (type, basename, size)

  68. Reconciling Two Sorts • Linus' law: files grow, larger objects

    tend to be "more recent" • we only write out the base object first if the delta against it was more recent • Thus the front of the pack always contains data that is relevant to a “recent" object
  69. As a Result • delta order and recency order match

    each other quite well • xdelta, removing data is cheaper (in size) than adding data • xdelta, larger->small is actually a big space saver too
  70. git-pack-objects w/ threads --threads=<n> Specifies the number of threads to

    spawn when searching for best delta matches. This requires that pack-objects be compiled with pthreads otherwise this option is ignored with a warning. This is meant to reduce packing time on multiprocessor machines. The required amount of memory for the delta search window is however multiplied by the number of threads. Specifying 0 will cause Git to auto-detect the number of CPU's and set the number of threads accordingly.
  71. git-receive-pack • First called with --advertise-refs • (ntohl(hdr.hdr_entries) < unpack_limit)

    =>
 git-unpack-objects • (ntohl(hdr.hdr_entries) >= unpack_limit) =>
 git-index-pack • Finally update refs with locks
  72. Be safe

  73. Be safe

  74. Update refs with lock

  75. Thank you https://github.com/pmq20/