Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Internals of Git Internals

Minqi Pan
January 07, 2016

Internals of Git Internals

Minqi Pan

January 07, 2016
Tweet

More Decks by Minqi Pan

Other Decks in Programming

Transcript

  1. Internals of Git Internals
    Minqi Pan

    View Slide

  2. I’m Minqi Pan
    github.com/pmq20
    twitter
    @psvr

    View Slide

  3. Synopsis
    • Internals of downloading code
    • Internals of uploading code

    View Slide

  4. Question:

    Which commands to use?

    View Slide

  5. 140 commands
    add, am, annotate, apply, archimport, archive, bisect, blame, branch, bundle, cat-file,
    check-attr, check-ignore, check-mailmap, checkout, checkout-index, check-ref-format,
    cherry, cherry-pick, citool, clean, clone, column, commit, commit-tree, config, count-
    objects, credential, credential-cache, credential-store, cvsexportcommit, cvsimport,
    cvsserver, daemon, describe, diff, diff-files, diff-index, diff-tree, difftool, fast-export,
    fast-import, fetch, fetch-pack, filter-branch, fmt-merge-msg, for-each-ref, format-patch,
    fsck, gc, get-tar-commit-id, grep, gui, hash-object, help, http-backend, http-fetch, http-
    push, imap-send, index-pack, init, instaweb, interpret-trailers, log, ls-files, ls-remote, ls-
    tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index, merge-one-file,
    mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-objects, pack-
    redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push,
    quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace,
    request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell,
    shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status,
    stripspace, submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-
    index, update-ref, update-server-info, upload-archive, upload-pack, var, verify-commit,
    verify-pack, verify-tag, whatchanged, worktree, write-tree

    View Slide

  6. – Chapter 10. Git Internals
    “Git is fundamentally a content-addressable
    filesystem with a VCS user interface written
    on top of it.”

    View Slide

  7. a content-addressable
    filesystem

    View Slide

  8. simple

    straightforward

    a whole-object store

    View Slide

  9. “Let there be packs!”

    View Slide

  10. Pack-files
    • USAGE I: streaming
    • USAGE II: on-disk storage
    • w/ nice balance of density vs ease-of use
    • “Git really doesn't follow files”

    View Slide

  11. Git Objects Types

    View Slide

  12. Random Access
    • compressed just one object at a time
    • pessimizations due to double usage: low
    compression factor as a whole
    • able to translate object name to location in pack

    View Slide

  13. Spotlights of today
    add, am, annotate, apply, archimport, archive, bisect, blame, branch, bundle, cat-file, check-attr,
    check-ignore, check-mailmap, checkout, checkout-index, check-ref-format, cherry, cherry-pick,
    citool, clean, clone, column, commit, commit-tree, config, count-objects, credential, credential-
    cache, credential-store, cvsexportcommit, cvsimport, cvsserver, daemon, describe, diff, diff-files,
    diff-index, diff-tree, difftool, fast-export, fast-import, fetch, fetch-pack, filter-branch, fmt-
    merge-msg, for-each-ref, format-patch, fsck, gc, get-tar-commit-id, grep, gui, hash-object, help,
    http-backend, http-fetch, http-push, imap-send, index-pack, init, instaweb, interpret-trailers,
    log, ls-files, ls-remote, ls-tree, mailinfo, mailsplit, merge, merge-base, merge-file, merge-index,
    merge-one-file, mergetool, merge-tree, mktag, mktree, mv, name-rev, notes, p4, pack-
    objects, pack-redundant, pack-refs, parse-remote, patch-id, prune, prune-packed, pull, push,
    quiltimport, read-tree, rebase, receive-pack, reflog, relink, remote, repack, replace,
    request-pull, rerere, reset, revert, rev-list, rev-parse, rm, send-email, send-pack, shell,
    shortlog, show, show-branch, show-index, show-ref, sh-i18n, sh-setup, stash, status, stripspace,
    submodule, svn, symbolic-ref, tag, unpack-file, unpack-objects, update-index, update-ref,
    update-server-info, upload-archive, upload-pack, var, verify-commit, verify-pack, verify-tag,
    whatchanged, worktree, write-tree

    View Slide

  14. Remote Procedure Calls
    • git-fetch-pack —ssh,http—> git-upload-pack
    • git-send-pack —ssh,http—> git-receive-pack

    View Slide

  15. Local UNIX-y Calls
    • git-upload-pack calls git-pack-objects
    • git-receive-pack calls git-unpack-objects
    • git-receive-pack also calls git-index-pack

    View Slide

  16. Advantage of being UNIX-y
    Use of global variables anywhere
    Needless to pass around

    View Slide

  17. git-upload-pack
    • first called with --stateless-rpc --advertise-refs
    • then called with --stateless-rpc
    • internally calls git-pack-objects

    View Slide

  18. git-upload-pack --advertise-refs
    001e# service=git-upload-pack
    000000d10ffaa5a6fdf84714deddaecc62edd5540e0f5877 HEAD multi_ack thin-
    pack side-band side-band-64k ofs-delta shallow no-progress include-tag
    multi_ack_detailed no-done symref=HEAD:refs/heads/master agent=git/2.2.2
    0057a576d8e0d128b503bb3d9980387fea0976ab6921 refs/heads/bug_fix/a
    0053affee7f8adf11c7d895189ec5f0c3aeb95dba01a refs/heads/bug_fix/b
    0053cbe2e212b415f587f086af6a74670919ccb7b089 refs/heads/bug_fix/c
    004568df60003126a1f6478eead611af1c9000dfe44e refs/heads/feature/ci
    0052c063094d4f94e182a23aa9a479bf67bf6e2de6b4 refs/heads/feature/d
    004cfd02d43bc2ca02d381275453383e2c2629e33da5 refs/heads/feature/e
    0049fdb1e470fd4d9be344dc77645bf777eec07d7940 refs/heads/feature/f

    0000

    View Slide

  19. git-pack-objects
    • /usr/bin/git
    • pack-objects
    • --revs
    • --stdout
    • --progress
    • --delta-base-offset

    View Slide

  20. pack-*.idx
    a simple index to do
    "object name -> location in pack-files”
    translations

    View Slide

  21. pack-*.idx Data Structure

    View Slide

  22. pack-*.idx Data Structure

    View Slide

  23. pack-*.idx Data Structure
    A 256-entry fan-out table
    metaphorically
    256 * 4-byte

    View Slide

  24. pack-*.idx Data Structure
    0x0309 == number of
    objects whose first
    byte of object name is
    less than or equal to 1
    0x060D == number of
    objects whose first
    byte of object name is
    less than or equal to 2

    View Slide

  25. pack-*.idx Data Structure
    Question:
    Why 256 entries in total?

    View Slide

  26. pack-*.idx Data Structure
    Answer:
    size of a byte of the first
    20 bytes of SHA-1

    View Slide

  27. pack-*.idx Data Structure
    Question:
    Which byte has the
    number of total objects?

    View Slide

  28. pack-*.idx Data Structure
    Answer:
    8 + 255 * 4 ~ 8 + 256 * 4
    1028 ~ 1032

    View Slide

  29. pack-*.idx Data Structure
    Read it:
    0x02f1f2 == 193010

    View Slide

  30. pack-*.idx Data Structure
    sorted 20-byte
    SHA-1 object names

    View Slide

  31. pack-*.idx Data Structure
    Question:
    Why Sorted?

    View Slide

  32. pack-*.idx Data Structure
    Answer:
    Binary search after
    fan-out

    View Slide

  33. pack-*.idx Data Structure
    Question:
    Ends where?

    View Slide

  34. pack-*.idx Data Structure
    Answer:
    1032 + 20 * 0x02f1f2

    ==

    3861232

    View Slide

  35. pack-*.idx Data Structure
    A table of 4-byte
    CRC32 values of the
    packed object data.

    View Slide

  36. pack-*.idx Data Structure
    Question:
    Ends where?

    View Slide

  37. pack-*.idx Data Structure
    Answer:
    3861232 + 4 * 0x02f1f2
    ==
    4633272

    View Slide

  38. pack-*.idx Data Structure
    4-byte offset
    values
    MSB + 31-bit
    Most Significant Bit
    marks the usage of
    8-byte table

    View Slide

  39. pack-*.idx Data Structure
    Question:
    Any limitations?

    View Slide

  40. pack-*.idx Data Structure
    Answer:
    Pack files can only
    be 2147483648 in
    size at best

    View Slide

  41. pack-*.idx Data Structure
    Question:
    Ends where?

    View Slide

  42. pack-*.idx Data Structure
    Answer:
    4633272 + 4 * 0x02f1f2
    ==
    5405312

    View Slide

  43. pack-*.idx Data Structure
    A table of 8-byte offset entries.
    (empty for pack files less than 2 GiB)
    Question:
    Any limitations?

    View Slide

  44. pack-*.idx Data Structure
    A table of 8-byte offset entries.
    (empty for pack files less than 2 GiB)
    Answer:
    No. What file on earth weights 16800000 TB?
    Well supports packs larger than 2 GiB

    View Slide

  45. pack-*.idx Data Structure
    (trailer)
    A copy of the 20-byte SHA-1 checksum at the end of
    corresponding packfile.
    20-byte SHA-1-checksum of all of the above.

    View Slide

  46. pack-*.pack Data Structure
    header

    View Slide

  47. pack-*.pack Data Structure

    View Slide

  48. pack-*.pack Data Structure

    View Slide

  49. pack-*.pack Data Structure

    View Slide

  50. pack-*.pack Data Structure
    n * object entries

    View Slide

  51. Example
    • First byte of the name is 0x9f
    • IDX[8 + (0x9f - 1) * 4] == 0x0403 == 1027
    • IDX[8 + 0x9f * 4] == 0x0403 == 1029
    • Object No. 1027 ~ 1029
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9

    View Slide

  52. Example
    • Binary search 1027 ~ 1029
    • Found at 8 + 4 * 256 + 1027 * 20 == 21572
    • Skip the rest total_num*(20+4) == 1628*24
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9

    View Slide

  53. Example
    • IDX[8 + 4 * 256 + 1628*24 + 4 * 1027]
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    • PACK[0x0004482D] == PACK[280621]

    View Slide

  54. Example
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    E3 11100011
    1_______ => MSB 1 continue
    _110____ => type == 6 == OFS_DELTA
    ____0011 => length == 3
    3-bit type, (n-1)*7+4-bit length

    View Slide

  55. Example
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    • PACK[0x0004482D]
    01 00000001
    0_______ => MSB 0 break
    _0000001 => length += (1 << 4)
    final length == 19

    View Slide

  56. Example
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    • PACK[0x0004482D]
    AA 10101010
    1_______ MSB 1 continue
    _0101010 base offset == 42

    View Slide

  57. Example
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    • PACK[0x0004482D]
    44 01000100
    0_______ MSB 0 break
    _1000100 offset == ((42+1)<<7)+68
    == 5572

    View Slide

  58. Example
    Read 9fcf811e00fa469688943a9152c16d4ee90fb9a9
    offset == 5572
    push 0x0004482D into stack
    deal with (0x0004482D - 5572)
    push (0x0004482D - 5572) into stack

    root base

    View Slide

  59. Example
    SHA1 type size size-pack
    offset-
    pack
    depth base
    9fcf811e00fa469
    688943a9152c16d
    4ee90fb9a9
    blob 19 32 280621 4 6110c89446f2281
    e5db9b798a0fa02
    0fad6e63e1
    6110c89446f2281
    e5db9b798a0fa02
    0fad6e63e1
    blob 52 45 275049 3 3bbeff3fc22b75c
    1a26f4ab9b64449
    b33002aea5
    3bbeff3fc22b75c
    1a26f4ab9b64449
    b33002aea5
    blob 2935 1263 273786 2 a39920830904665
    6ecc01f7653c5d5
    b8905fc16e
    a39920830904665
    6ecc01f7653c5d5
    b8905fc16e
    blob 4686 1540 272246 1 e4e56117de8b3bd
    0bd899701da4712
    caee27c7d6
    e4e56117de8b3bd
    0bd899701da4712
    caee27c7d6
    blob 12635 3279 115703 0 -

    View Slide

  60. – Linus Torvalds
    “I played around with different delta
    algorithms, and with making the delta window
    bigger, but having too big of a sliding window
    makes it very expensive to generate the
    pack: you need to compare every object with
    a _ton_ of other objects.”

    View Slide

  61. – Linus Torvalds
    “ANY order will give you a working pack, ...
    [but it is] the thing that gives packs good
    locality. It keeps the objects close to the head
    (whether they are old or new, but they are
    _reachable_ from the head) at the head of
    the pack. So packs actually have absolutely
    _wonderful_ IO patterns.”

    View Slide

  62. Packing Heuristics
    • First sort by delta order
    • Then sort by recency order

    View Slide

  63. Delta Order Heuristics
    • first sort by type. Different objects never delta
    with each other.
    • we do not delta different object types.

    View Slide

  64. Delta Order Heuristics
    • then sort by filename/dirname.
    • we prefer to delta the objects with the same full
    path, but allow files with the same name from
    different directories.

    View Slide

  65. Delta Order Heuristics
    • then if we are doing "thin" pack, the objects we
    are _not_ going to pack but we know about are
    sorted earlier than other objects.
    • we always prefer to delta against objects we are
    not going to send, if there are some.
    • for "thin" packs only. used when the other side is
    known to have such objects.

    View Slide

  66. Delta Order Heuristics
    • and finally sort by size, larger to smaller.
    • we prefer to delta against larger objects, so that
    we have lots of removals.
    • large->small matters because of compression
    behaviour.

    View Slide

  67. sort2: recency
    sort 1: (type, basename, size)

    View Slide

  68. Reconciling Two Sorts
    • Linus' law: files grow, larger objects tend to be
    "more recent"
    • we only write out the base object first if the delta
    against it was more recent
    • Thus the front of the pack always contains data
    that is relevant to a “recent" object

    View Slide

  69. As a Result
    • delta order and recency order match each other
    quite well
    • xdelta, removing data is cheaper (in size) than
    adding data
    • xdelta, larger->small is actually a big space
    saver too

    View Slide

  70. git-pack-objects w/ threads
    --threads=
    Specifies the number of threads to spawn when searching for
    best delta matches. This requires that pack-objects be
    compiled with pthreads otherwise this option is ignored with
    a warning. This is meant to reduce packing time on
    multiprocessor machines. The required amount of memory
    for the delta search window is however multiplied by the
    number of threads. Specifying 0 will cause Git to auto-detect
    the number of CPU's and set the number of threads
    accordingly.

    View Slide

  71. git-receive-pack
    • First called with --advertise-refs
    • (ntohl(hdr.hdr_entries) < unpack_limit) =>

    git-unpack-objects
    • (ntohl(hdr.hdr_entries) >= unpack_limit) =>

    git-index-pack
    • Finally update refs with locks

    View Slide

  72. Be safe

    View Slide

  73. Be safe

    View Slide

  74. Update refs with lock

    View Slide

  75. Thank you
    https://github.com/pmq20/

    View Slide