Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Graph operations in Git, and how to make them f...

Graph operations in Git, and how to make them faster

Graph operations in Git version control system: how the performance was improved (for large repositories), how can it be further improved.

Git uses various clever methods for making operations on very large repositories faster, from bitmap indices for 'git fetch', to generation numbers (also known as topological levels) in the commit-graph file for commit graph traversal operations like 'git log --graph'. There are also other ideas that could be used to make those operations even faster.

Jakub Narębski

December 03, 2019
Tweet

Other Decks in Technology

Transcript

  1. Graph operations in Git version control system how the performance

    was improved (for large repositories), how can it be further improved dr Jakub Nar¦bski Nicolaus Copernicus University in Toru«, Poland presented on December 3, 2019 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 1 / 68
  2. Table of contents 1 Introduction Motivation Graphs in Git 2

    Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 2 / 68
  3. Table of contents 1 Introduction Motivation Graphs in Git 2

    Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 3 / 68
  4. Motivation: scaling Git up in the presence of the increasing

    size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 740 000 commits (2018) MS Windows: 1 700 000 commits (2018) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  5. Motivation: scaling Git up in the presence of the increasing

    size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 826 000 commits (2019) MS Windows: 3 100 000 commits (2019) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  6. Motivation: scaling Git up in the presence of the increasing

    size of repositories Git repositories are growing with respect to the number of commits examples: Linux kernel: 826 000 commits (2019) MS Windows: 3 100 000 commits (2019) Android (AOSP): 874 000 commits (2019) Chromium: 772 000 commits (2019) . . . Git: 55 000 commits (2019) noticeable slowdown of Git operations (taking now seconds) gitk i git log --graph git push --force-with-lease git status --ahead-behind . . . serialized commit-graph, since Git 2.18 (Derrick Stolee, Microsoft) space for storing auxiliary labels / reachability indices, such as e.g. the generation number dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 4 / 68
  7. Object graph Git repository as contentaddressed object database data in

    Git repositories is stored as Direct Acyclic Graph (DAG), that is, egdes are directed and there are no loops nodes (vertices) in this graph are objects of the following types commit  representing revisions, store project history tree  snapshot of project les at given point of time representing subdirectories (in a hierarchical way) blob  store le contents at given version of it tag  represents annotated or signed version of a project edges between nodes represents relationships commit → commit: based on relationship, the second one is parent of the rst (each revision has zero or more parent commits) commit → tree: project repository contents at given revision tree → tree and tree → blob: lesystem hierarchy tag → object (usually to commit): symbolic name of the object xkcd.com/1597/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 5 / 68
  8. Visualization of the objects graph in the Git repository Object

    graph and contents deduplication Hierarchical le structure dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
  9. Visualization of the objects graph in the Git repository Object

    graph and contents deduplication Hierarchical le structure dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 6 / 68
  10. Object graph in a Git repository (object model of a

    repository) Derrick Stolee Advanced Git for Beginners httpsXGGstoleeFdevGdo™sGgitFpdf dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 7 / 68
  11. Object graph and external references to it: branches, tags, the

    index,. . . example from httpsXGGgithu˜F™omGsensorfloGgitEdr—wGwiki dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 8 / 68
  12. Commit graph Representation of project history in Git (Almost) every

    commit (revision) has reference to its parent, that is the revision before it was based on Some revisions (commits), being result of merge operation, have two parents (rarely more  so called octopus merges) At least one (initial) revision has no parents Each commit object includes reference to the tree representing the snapshot of les in the repository Branches and tags are external references to the commit graph HEAD is a symbolic reference to the current branch (detached HEAD directly points to a commit) c7cd3 master HEAD f30ab 34ac2 v0.9 98ca9 23b88 d77af dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 9 / 68
  13. Object addressing SHA-1 / SHA-256 of object representation as object

    identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a commit: 6 git ™—tEfile Ep rieh¢ tree RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH ™ommitter tunio g r—m—no `gitsterdpo˜oxF™omb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX tunio g r—m—no `gitsterdpo˜oxF™omb http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  14. Object addressing SHA-1 / SHA-256 of object representation as object

    identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a tree: 6 git ™—tEfile Ep RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU IHHTRR ˜lo˜ SeWVVHT™T™™PRT—™efSfSQW—eIWIUIH—H™HT—dQf Fgit—ttri˜utes IHHTRR ˜lo˜ I™PfVQPIQVTfVWefV™HQdIIISW™WU—HfIWR™RRPQ Fgitignore IHHTRR ˜lo˜ eS˜RIPT˜e™SSUd˜SSWPR˜U˜THedUHQRWTPTe—P™R Fm—ilm—p IHHTRR ˜lo˜ ™Q˜fW™TdRdI™THRWddQV—S—VTId™e˜R™VeI˜UeWW Ftr—visFyml IHHTRR ˜lo˜ SQTeSSSPRd˜UP˜dP—™fIUSPHV—efRfQdf™IRVdRP gy€‰sxq HRHHHH tree QUHff˜fdTVW—SdQHU—SdWW˜eQPHHT˜efI˜HdQQed ho™ument—tion IHHUSS ˜lo˜ SVUQfITQeSITVRffRf™SQIIW™SWfeP™Rf™PRf—˜e qs„E†i‚ƒsyxEqix IHHTRR ˜lo˜ ff˜HUIeWfHQ—UW—HSP˜e——RQUPf—UWHe™˜—˜˜˜U˜ sxƒ„evv FFF http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  15. Object addressing SHA-1 / SHA-256 of object representation as object

    identiers Each object in the repository's object database is referenced using the SHA-1 hash of the object contents representation (switch to SHA-256 aka NewHash is in progress) Examples: object representation of a tag: 6 git ™—tEfile Ep vPFUFH o˜je™t USRVVRPSS˜˜SVHdfISWeSVdef—VI™ddQH˜S™RQH™ type ™ommit t—g vPFUFH t—gger tunio g r—m—no `gitsterdpo˜oxF™omb IRSIWRSPWP EHVHH qit PFU !!Efiqsx €q€ ƒsqxe„…‚i!!E †ersionX qnu€q vI FFF !!Eixh €q€ ƒsqxe„…‚i!!E http://shafiul.github.io/gitbook dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 10 / 68
  16. Table of contents 1 Introduction Motivation Graphs in Git 2

    Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 11 / 68
  17. Git commands working directly on the object graph object exchange

    (object transfer) git fetch git push git clone server and clients perform negotiation to send only those (new) objects that are necessary (that are missing from the other side) garbage collection git repack git gc removing unreachable objects (results of git ™ommit EE—mend, git re˜—se, multiple git —dd `file b, etc) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 12 / 68
  18. Git commands working directly on the commit graph (1/2) breakdown

    into categories based on the type of the result commands returning boolean: true/false value git merge-base --is-ancestor A B is B reachable from A commands returning subset (of larger set) git branch --contains (or git tag ...) branches/tags from which given commit is reachable git branch --merged (or git tag ...) branches/tags reachable from given commit autofollowing tags during git fetch (see documentation of the remote.<name>.tagOpt) commands nding node or nodes in the commit graph git merge-base --all A B nding lowest (closest) common ancestors dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
  19. Git commands working directly on the commit graph (1/2) breakdown

    into categories based on the type of the result commands returning boolean: true/false value git merge-base --is-ancestor A B is B reachable from A commands returning subset (of larger set) git branch --no-contains (or git tag ...) branches/tags from which given commit is unreachable git branch --no-merged (or git tag ...) branches/tags unreachable from given commit autofollowing tags during git fetch (see documentation of the remote.<name>.tagOpt) commands nding node or nodes in the commit graph git merge-base --all A B nding lowest (closest) common ancestors dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 13 / 68
  20. Git commands working directly on the commit graph (2/2) breakdown

    based on the type of the result, continued commands returning path / subgraph git log A..B ≡ git log B --not A reachable from B and unreachable from A git log A...B (symmetrical dierence) A...B ≡ A B --not $(git merge-base --all A B) exclusively reachable from either A or from B git log --ancestry-path A..B commits directly on path leading from B to A (inclusive) topological sorting options (and equivalent) git log --topo-sort / --graph, gitk, etc. additionally those try to keep related revisions together iterative bisection of graph (to nd regression) git bisect A..B A B A B A...B < < A < O > > B > dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 14 / 68
  21. Denition of reachability in a directed acyclic graph (DAG) Denition

    of reachability (graph theory) Let's assume that we have directed acyclic graph G = (V ,E), where V is nite set of vertices (nodes), and E ⊂ V 2 is nite set of directed edges. ∀(u,v) ∈ V 2 we say that v is reachable from u, which we denote as r(u,v) or as u ⇝ v, if and only if u = v or ∃(u,w) ∈ E ∧r(w,v). Properties of this relation ∀v ∈ V : r(v,v) r(u,w)∧r(w,v) =⇒ r(u,v) r(u,v)∧r(v,u) =⇒ u = v Reachability relation imposes partial order for nodes in the graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 15 / 68
  22. Lowest (closest) common ancestors, or merge base Lowest common ancestor(s)

    is used when merging (via git merge) two branches A and B lowest (closest) common ancestor, like P and Q, is reachable both from A and from B it is not reachable from any other revision reachable from both A and from B httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 16 / 68
  23. Topological sorting of directed acyclic graph (DAG) Denition of topological

    sorting Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes (vertices), for which (u,v) ∈ E =⇒ u ≺ v dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
  24. Topological sorting of directed acyclic graph (DAG) Denition of topological

    sorting Topological sorting of directed acyclic graph: any such linear full ordering (≺) of nodes (vertices), for which (u,v) ∈ E =⇒ u ≺ v dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 17 / 68
  25. Table of contents 1 Introduction Motivation Graphs in Git 2

    Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 18 / 68
  26. Bitmap indices  solving the problem of object exchange here

    bitmap means simply bit vector (vector of 0/1 values) image: Scott Chacon Pro Git, https://git-scm.com/book I—RIHe ™—™H™— fdfRf™ Q™RQW™ HISSe˜ dVQPWf IfU—U— f—RW˜H WQ˜——e I—RIHe 1 1 1 1 1 1 1 1 1 ™—™H™— 0 1 1 0 1 1 1 1 1 fdfRf™ 0 0 1 0 0 1 0 0 1 bit location corresponds to the position of the object in the packle bit 1 in the bitmap for a given revision means that object with given position is reachable from it bit 0 in the bitmap: not reachable objects to be transferred: want AND NOT have dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 19 / 68
  27. Bitmap indices  nding objects that we have (and don't

    need to fetch) http://githubengineering.com/counting-objects/ $ GIT_TRACE_PACKET=1 git pull [...] ... fetch-pack> want a595... ... fetch-pack> want a4c7... ... fetch-pack> want d1c7... [...] ... fetch-pack> 0000 ... fetch-pack> have cc3f... ... fetch-pack> have 5bd5... [...] ... fetch-pack> 0000 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 20 / 68
  28. Bitmap index for the packle Selected details of bitmap index

    implementation in Git We cannot create the reachability bitmap (bit vector) for every revision, because it would take too much space (storing transitive closure of the object graph) use heuristic algorithm to select which commits will have bitmap let newest versions have bitmap, which is needed for cloning the deeper in the commit graph (earlier in history), the less frequently reachability bitmaps are added (down to every 3000 revisions) Minimize space taken by bitmaps by using RLE compression EWAH Bitmaps  Daniel Lemire (implementation in Java, C#, C++) patent free (which is unfortunately not the case for every bitmap compression algorithm) JGit support added by Shawn Pearce and Colby Ranger (Google) ported to C as libewok by Vincent Marti (GitHub) good enough compression level with fast decompression some operations on bitmap do not require decompression to perform Trick: store result of XOR with some bitmap for an earlier commit dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 21 / 68
  29. EWAH (Enhanced Word-Aligned Hybrid) format wordaligned format dierent variants: 32bit,

    64bit speed at the cost of compression ratio clean words: run of 0 or of 1 RLE (run length) encoding how many repeating 0s or 1s dirty words: mixed 0 with 1 literal encoding EWAH: marker words length and type of sequence bit operations on compressed bitmaps without decompressing them (symmetric operations only) Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes. http://arxiv.org/abs/0901.3751 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 22 / 68
  30. Bitmap index: References Vicent Martí (GitHub). Counting Objects. GitHub Engineering,

    22 Sep 2015 http://githubengineering.com/counting-objects/ https://github.blog/2015-09-22-counting-objects/ Shawn Pearce (Google). Scaling Up JGit. EclipseCon 2013, Boston. Daniel Lemire. All About Bitmap Indexes. . . And Sorting Them slides presented at BDA'08 and DOLAP'08, 12 Feb 2009. http://lemire.me/talks/uqamtalk.pdf ▶ Vicent Martí. GIT bitmap v1 format. Documentation/technical/bitmap-format.txt dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 23 / 68
  31. Extracting information about edges in the commit graph Following the

    edges requires: nding the object (packed or loose) object decompression (gzip) possibly resolving deltas parsing the commit object commit object representation: tree 49f152498cfde082f3223e20338ca64517991cd7 parent bdd1cc20929e9f7631dd29ff70426eea53f69443 author A U Thor <[email protected]> 1452640851 -0800 committer C O Mitter <[email protected]> 1452640851 -0800 First batch for post 2.7 cycle Signed-off-by: A U Thor <[email protected]> Junio Hamano Git Chronicles talk at GitTogether 2008 loose format packed format (one le per object) (multiple objects) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 24 / 68
  32. Two ways of storing objects in Git  an outline

    Junio Hamano Git Chronicles talk at GitTogether 2008 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 25 / 68
  33. Commit-graph le format (storing serialized commit graph) Following the edges

    required: nding the object (packed or loose) object decompression (gzip) possibly resolving deltas parsing the commit object This can take a long time, especially in large repositories, where it needs to be done 1000s of times Git 2.18 and later supports commit-graph le, which stores this DAG information in a compact form fanout table  binary search seeds list of commits sorted by the commit ID commit data, with parents as position on list list of 3rd and later parents: octopus edges dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 26 / 68
  34. Making Git faster with commit-graph le (serialized commit graph) Linux

    kernel (around 750 000 revisions / commits) command before after change git mergeE˜—se m—ster topi™ 0.52 0.06 -88% git ˜r—n™h EE™ont—ins 76.20 0.04 -99% git t—g EE™ont—ins 5.30 0.03 -99% git t—g EEmerged 6.30 1.50 -76% git log EEgr—ph EIH 5.90 0.74 -87% m—ster: 032b4cc884490c4bc7c4ef8c91e6d topi™: 62d18ecfa64137349fac9c5817784fb where topi™ branch is 30 986 revisions before m—ster branch, and m—ster branch can reach 722 849 revisions Git code repository (around 50 000 revisions) command before after change git mergeE˜—se m—ster topi™ 0.10 0.04 -60% git ˜r—n™h EE™ont—ins 0.76 0.03 -96% git t—g EE™ont—ins 0.70 0.03 -96% git t—g EEmerged 0.74 0.12 -84% git log EEgr—ph EIH 0.44 0.05 -89% m—ster: b50d82b00a8fc9d24e41ae7dc30185 topi™: e144d126d74f5d2702870ca9423743 where m—ster is 2032 revisions behind the topi™ branch, and can reach 49 361 commits dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
  35. Making Git faster with commit-graph le (serialized commit graph) Linux

    kernel (around 750 000 revisions / commits) command before after change git mergeE˜—se m—ster topi™ 0.52 0.06 -88% git ˜r—n™h EE™ont—ins 76.20 0.04 -99% git t—g EE™ont—ins 5.30 0.03 -99% git t—g EEmerged 6.30 1.50 -76% git log EEgr—ph EIH 5.90 0.74 -87% m—ster: 032b4cc884490c4bc7c4ef8c91e6d topi™: 62d18ecfa64137349fac9c5817784fb where topi™ branch is 30 986 revisions before m—ster branch, and m—ster branch can reach 722 849 revisions MS Windows, with GVFS (around 1 700 000 commits) command before after change git st—tus EE—he—dE˜ehind 14.30 4.70 -67% git mergeE˜—se m—ster topi™ 11.40 1.80 -84% git ˜r—n™h EE™ont—ins 9.40 1.60 -83% git log EEgr—ph EIH 24.30 5.30 -78% where m—ster includes 2 214 796 reachable commits, and local version of m—ster is 81 776 revisions behind originGm—ster, which aect speed of git st—tus; such value is typical for development there dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 28 / 68
  36. Serialized commit graph: References Derrick Stolee (Microsoft) Supercharging the Git

    Commit Graph series, Azure DevOps Blog, 2018 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phG httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiEfileEform—tG Johannes Schindelin (Microsoft), Derrick Stolee (Microsoft) Making Git for Windows: starting from 15:00, Git Merge 2018 httpsXGGyoutuF˜eGoywziWVQmwctaWHS ▶ Documentation/technical/commit-graph-format.txt Git commit graph format dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 29 / 68
  37. The creation date of the commit object as heuristics tree

    RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH ™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb Revision dates authordate is creation date for changes (authorship) committerdate is date those changes were added to repository (creating commit object) commit object data includes date and time of its creation revisions based on it must have been created later with regard to a global time unfortunately because of lack of clock synchronization we cannot entirely rely on this data it can be however used as heuristics  as stop condition, and for order of traversal Git stops searching after nding 5 (SLOP) revisions that are older than a boundary dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
  38. The creation date of the commit object as heuristics tree

    RWfISPRWV™fdeHVPfQPPQePHQQV™—TRSIUWWI™dU p—rent ˜ddI™™PHWPWeWfUTQIddPWffUHRPTee—SQfTWRRQ —uthor e … „hor `thordex—mpleF™omb IRSPTRHVSI EHVHH ™ommitter g y witter `terdex—mpleFusb IRSPTRHVSI EHVHH pirst ˜—t™h for post PFU ™y™le ƒignedEoffE˜yX e … „hor `thordex—mpleF™omb Revision dates authordate is creation date for changes (authorship) committerdate is date those changes were added to repository (creating commit object) commit object data includes date and time of its creation revisions based on it must have been created later with regard to a global time unfortunately because of lack of clock synchronization we cannot entirely rely on this data it can be however used as heuristics  as stop condition, and for order of traversal Git stops searching after nding 5 (SLOP) revisions that are older than a boundary dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 30 / 68
  39. Generation number / topological level Denition of level in graph

    / of generation number Level lv of vertex v in graph G = (V ,E) is dened as its depth, that is the length of longest path from v to root if v has no parents (outgoing edges), i.e. if v is a root, then lv = 0 otherwise lv = max u : (v,u)∈E {lu}+1 Properties: if u ⇝ v and u ̸= v, then lu < lv if u ⇝ v, then lu lv (weaker condition) Example DAG with topological levels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
  40. Generation number / topological level Denition of level in graph

    / of generation number Level lv of vertex v in graph G = (V ,E) is dened as its depth, that is the length of longest path from v to leaf if v has no predecessors (outgoing edges), i.e. if v is a leaf, then lv = 0 otherwise lv = max u : (v,u)∈E {lu}+1 Properties: if u ⇝ v and u ̸= v, then lu < lv if u ⇝ v, then lu lv (weaker condition) Example DAG with topological levels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 31 / 68
  41. Generation numbers (levels) in the commit graph Example graph of

    revisions: Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018 https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
  42. Generation numbers (levels) in the commit graph Generation numbers for

    this graph: Derrick Stolee Supercharging the Git Commit Graph III: Generations and Graph Algorithms, July 2018 https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 32 / 68
  43. Historical approaches to adding generation number: 2011 (or earlier) Idea

    1: adding new header to the commit tree 49f152498cfde082f3223e20338ca64517991cd7 parent bdd1cc20929e9f7631dd29ff70426eea53f69443 author A U Thor <[email protected]> 1452640851 -0800 committer C O Mitter <[email protected]> 1452640851 -0800 generation 32145 First batch for post 2.7 cycle Signed-off-by: A U Thor <[email protected]> problems with backward compatibility question about ensuring corectness repositories with and without it possibly copying unknown headers by cherry-pick, revert, etc. Idea 2: using git notes as cache git notes technique allows to add notes to any object: blob, commit, . . . notes are split into namespaces, e.g. the default refs/notes/commit the textconv mechanism can be congured to use them as a cache pytania o zapewnienie poprawno±ci performance: the notes mechanism is not intended for a very large amount of notes, O(commits) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 33 / 68
  44. Storing generation numbers in the commit-graph le Commit Data chunk:

    CDAT includes column (eld) intended for commit creation date (committerdate ) if the column was using signed 32bit integer  Y2038 problem therefore 64bit wide column is used (two 4byte words): 30 most signicant bits of rst 4 bytes are used for the generation number 32 bits of second 4 bytes and 2 least signicant bits of the previous 4byte word are used for committerdate, which makes together 34 bits to store datetime as Unix timestamp Denition of generation numer in the commit-graph le Generation number gen(A) of revision A of a project is dened in the following way if A has no parents, i.e. it is a root commit, then gen(A) = 1 otherwise its generation number is one more than maximum generation number among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
  45. Storing generation numbers in the commit-graph le Commit Data chunk:

    CDAT includes column (eld) intended for commit creation date (committerdate ) if the column was using signed 32bit integer  Y2038 problem therefore 64bit wide column is used (two 4byte words): 30 most signicant bits of rst 4 bytes are used for the generation number 32 bits of second 4 bytes and 2 least signicant bits of the previous 4byte word are used for committerdate, which makes together 34 bits to store datetime as Unix timestamp Denition of generation numer in the commit-graph le Generation number gen(A) of revision A of a project is dened in the following way if A has no parents, i.e. it is a root commit, then gen(A) = 1 otherwise its generation number is one more than maximum generation number among all its parents: gen(A) = max{P ∈ parent(A): gen(P)}+1 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 34 / 68
  46. Corner cases of generation number in Git Corner cases: for

    historical reasons commits that were present in commit-graph le before generation number was added to Git had gen(C) = 0 that is why gen(root commits) = 1 newly created commits, not present in the commit-graph, have gen(C) greater than maximal representable generation number: gen(C) = 0x3FFFFFFF for such A and B we have gen(A) = gen(B), but nothing is known about their reachability relation gen(C) properties: if A ⇝ B, and A ̸= B, then gen(A) < gen(B) except for the corner cases if A ⇝ B, then gen(A) gen(B) (weaker) including the corner cases ⇕ if gen(A) > gen(B), then A ̸⇝ B including the corner cases Conclusion: gen(C) can be used as cuto, even if new revisions are not present in the commit-graph le dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 35 / 68
  47. Denition and properties of generation number Denition of generation number

    / level of node Generation number of a revision (commit graph node) is dened in the following way: If commit has no parents, then its generation number is 1. Otherwise, its generation number is taken to be 1 greater than maximum generation number of its parents Properties of generation numbers / topological levels: If a commit B is reachable from A (and both are present in the commit-graph le), then the generation number of B is smaller than generation number of A A ⇝ B =⇒ gen(A) > gen(B) Therefore if the generation number of B is greater or equal to the generation number of A, then B is not reachable from A (if both are present in the commit-graph le) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
  48. Denition and properties of topological level Denition of generation number

    / level of node Level of a revision (commit graph node) is dened in the following way: If node x has no outgoing edges, then gen(x) = 1. Otherwise gen(x) = maxv∈V {gen(v)}+1, where x → v (there exists edge from x to v, i.e. (x,v) ∈ E) Properties of generation numbers / topological levels: If a node B is reachable from A , then the generation number of B is smaller than generation number of A A ⇝ B =⇒ gen(A) > gen(B) Therefore if the generation number of B is greater or equal to the generation number of A, then B is not reachable from A dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 36 / 68
  49. Gains from using generation numbers Using generation numbers improves the

    performance of the following commands: git branch/git tag --contains <commit> which nds all branches / tags from which <commit> is reachable git branch/git tag --merged <commit> which nds all branches / tags reachable from <commit> git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors  merge bases with git merge-base (or indirectly by git merge and git log <A>...<B> ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  50. Gains from using generation numbers Using generation numbers improves the

    performance of the following commands: git branch/git tag --no-contains <commit> which nds all branches / tags from which <commit> is unreachable git branch/git tag --no-merged <commit> which nds all branches / tags unreachable from <commit> git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors  merge bases with git merge-base (or indirectly by git merge and git log <A>...<B> ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  51. Gains from using generation numbers Using generation numbers improves the

    performance of the following commands: git branch/git tag --no-contains <commit> which nds all branches / tags from which <commit> is unreachable git branch/git tag --no-merged <commit> which nds all branches / tags unreachable from <commit> git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors  merge bases with git merge-base (or indirectly by git merge and git log <A>...<B> ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  52. Gains from using generation numbers Using generation numbers improves the

    performance of the following commands: git branch/git tag --no-contains <commit> which nds all branches / tags from which <commit> is unreachable git branch/git tag --no-merged <commit> which nds all branches / tags unreachable from <commit> git push with push.followTags cong variable set to true git push --force and git push --force-with-lease checking if a given tag points to any transferred commit Generation numbers can also be used to speed up (not always true for reallife repositories): computing lowest / closest common ancestors  merge bases with git merge-base (or indirectly by git merge and git log <A>...<B> ) topological sorting (outputting a revision before its parents) in git log --graph (or directly with --topo-order) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 37 / 68
  53. Using generation numbers for reachability queries Is commit T reachable

    from commit A? 1 2 3 4 5 6 7 8 9 A R T B dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 38 / 68
  54. Using generation numbers to compute git merge-base --all Lowest common

    ancestors of A and B (reachable from both A and from B) are P and Q 1 2 3 4 5 6 7 8 9 P A R B Q dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 39 / 68
  55. Topological sorting: Kahn's algorithm Assumption: we walk the edges according

    to their direction 1 Compute indegree for each node; it is the number of its incoming edges 2 Walk the graph, selecting a node with indegree of zero, and decreasing the indegree of its parents Q can be a queue, a priority queue, a stack, etc. which gives dierent topological orders Q ← Queue with nodes that have the in-degree of 0 while Q is not empty do remove node n from the beginning of queue Q (of "independent" nodes) add n to the end of list L of topologically sorted nodes for each node m where exists edge e from n to m do remove edge e from the graph (which decreases in-degree of m) if there are no incoming edge leading to m then add node m to Q dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 40 / 68
  56. Advantages and disadvantages of using Kahn algorithm in Git advantages

    of Kahn algorithm one can select the order which keeps commits on branch together (which is needed for git log --graph) easy inclusion of graph traversal limits like git log --topo-order A...B it is possible to terminate second step early for example after showing rst full page of results; at least in theory disadvantages / limitations (for unmodied one) whole graph needs to be traversed in rst step to nd all independent nodes, with the indegree of zero 1 limit•list@A 2 sort•in•topologi™—l•order@A 3 get•revision•I@A git log --graph etc. may need only the rst page of results (output goes to the pager) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 41 / 68
  57. Using generation numbers during topological sorting INDEGREE (generation number cuto)

    while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queue IN-Q (INDEGREE_QUEUE): with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  58. Using generation numbers during topological sorting INDEGREE (generation number cuto)

    while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queue IN-Q (INDEGREE_QUEUE): with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  59. Using generation numbers during topological sorting with limits EXPLORE (generation

    number cuto) while priority queue EXPLORE-Q is not empty and maximum gen(x) is than cuto do take into account limits of eFFf type add interesting parents to EXPLORE-Q ⇓ INDEGREE (generation number cuto) while priority queue IN-Q is not empty and maximum gen(x) is than cuto do remove commit C with highest gen(C) from IN-Q EXPLORE(gen(C)) add 1 to in-degree of each of C parents if in-degree of C is 0, add it to TOPO-Q priority queues EXPLORE-Q i IN-Q: with respect to maximum generation number (level) ⇒ TOPO TOPO-Q ← priority queue, in-degree = 0 cuto ← generation number of rst in TOPO-Q while priority queue TOPO-Q is not empty do remove commit C from the start of TOPO-Q add C at the end of sorted list L (output it) for each parent P of commit C do if gen(P) is lower than cuto then set cuto to gen(P) walk INDEGREE(cuto) decrement in-degree of commit P if in-degree of P is equal 0 then insert P into priority queue TOPO-Q priority queue TOPO-Q: with respect to selected output sorting order dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 42 / 68
  60. Improving topological sorting performance with generation numbers Linux kernel (2018)

    Test: git rev-list --topo-order -100 HEAD setup time [s] change without commit-graph 6.80  with commit-graph (old algorithm) 0.77 -88.7% with commit-graph and generation number 0.02 -99.7% Test: git rev-list --topo-order -100 HEAD -- tools setup time [s] change without commit-graph 9.63  with commit-graph (old algorithm) 6.06 -37.1% with commit-graph and generation number 0.06 -99.4% taken from the commit message in revision.c: generation-based topo-order algorithm dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 43 / 68
  61. Generation number: References Derrick Stolee (Microsoft) Supercharging the Git Commit

    Graph III: Generations and Graph Algorithms, Azure DevOps Blog, 9 July 2018 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGsuper™h—rgingEtheEgitE™ommitEgr—phEiiiEgener—tions Developer Homepage of Derrick Stolee httpsXGGstoleeFdevG John Briggs (Microsoft) Technical contributions towards scaling for Windows, Git Merge 2019 httpsXGGwwwFyoutu˜eF™omGw—t™hcvav—tWU—VgHoH ▶ Documentation/technical/commit-graph.txt Git Commit Graph Design Notes dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 44 / 68
  62. Table of contents 1 Introduction Motivation Graphs in Git 2

    Operations on graphs 3 Methods for improving performance Bitmap index Generation number Algorithm for nding common ancestors Algorithm for topological sorting 4 Future work Corrected commit creation date Other graph labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 45 / 68
  63. Trouble with using topological level as generation number generation number

    (topological level) serves as reachability index gen(A) > gen(B) =⇒ A ̸⇝ B elimination of unreachable revisions (negativecut lter) before it, as heuristics, committerdate was used for this purpose it can be incorrect due to, for example, clock desynchronization used as cuto threshold with slop (5 revisions in the row) in order to speed up calculations in most of new algorithms that make use of generation number commit objects are sortowane according to this value (in priority queue) optionally using committerdate (commit creation date) to resolve ties it turned out that in some cases we can get worse performance when using generation numbers as compared to committerdate heuristics the algorith using the generation number always returns correct result number generation used as sort key selects longest paths rst (?) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 46 / 68
  64. Examples of decreased performance (using number of visited commits) Test:

    git merge-base A B repository A B date generation Androidbase 53c1972bc8f 92f18ac3e39 81 999 109 025 Linux 69973b830859 c470abd4fde4 44 984 47 457 Linux c8d2bc9bc39e 69973b830859 167 468 635 579 TypeScript 35ea2bea76 123edced90 3464 3439 httpsXGGgithu˜F™omGderri™kstoleeGgenEtest partial solution sort by committerdate only when there is no generation number cuto provided accept the possibility of performance regression for some rare history topologies (more of a problem for git merge-base, than for git log --topo-order A..B) alternative solution: using other generation number than topological level dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 47 / 68
  65. Alternative sorting orders / generation numbers V0: (minimal) generation number

    / topological level gen(C) is 1 greater than the maximum of gen(P) of parents gen(C) for the commit with no parents is 1 computable locally and incrementally, immutable V1: (epoch, commit creation date) pair epoch is not smaller than maximum of epochs of parents, increased by 1 if parent has earlier date than the current commit computable locally, immutable, compatibile with V0 V2: maximal generation number / reverse topological order (almost) gen(C) for commit without children is set to the number of commits in the graph otherwise gen(X) is 1 greater than minimum among children not computable incrementally, compatibile with V0 V3: corrected commit creation date gen(C) is maximum of C committerdate and corrected dates for parents (+1) computable locally and incrementally, immutable, incompatibile with V0 best performance: V2, V3 incremental computation is more important: V3 version number eld in the ™ommitEgr—ph format httpsXGGgithu˜F™omGderri™kstoleeGgenEtest dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 48 / 68
  66. Incremental update of commit-graph le rewriting commit-graph le to add

    information about new commit is time consuming done during garbage collection lowcost automatic update would be preferred for example updating during git fetch solution: chain of commit-graph les lowest layer is selfsucient (closed) higher layers can reference commits in lower layers set limits (conditions) higher layers are down to X times smaller than lower ones maximum layer size (except for the lowest one (base)) merging layers if needed to fullll the above conditions good amortized time is assured taking into account time to merge layers three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
  67. Incremental update of commit-graph les rewriting commit-graph le to add

    information about new commit is time consuming done during garbage collection lowcost automatic update would be preferred for example updating during git fetch solution: chain of commit-graph les lowest layer is selfsucient (closed) higher layers can reference commits in lower layers set limits (conditions) higher layers are down to X times smaller than lower ones maximum layer size (except for the lowest one (base)) merging layers if needed to fullll the above conditions good amortized time is assured taking into account time to merge layers three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 49 / 68
  68. The chain of commit-graph les ™ommitEgr—ph chain le format (CDAT

    chunk) three layer commit-graph dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
  69. The chain of commit-graph les ™ommitEgr—ph chain le format (CDAT

    chunk) three layer commit-graph https://devblogs.microsoft.com/devops/updates-to-the-git-commit-graph-feature/ dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 50 / 68
  70. Problems with changing the denition of generation number incremental update

    of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing >) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  71. Problems with changing the denition of generation number incremental update

    of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing >) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  72. Problems with changing the denition of generation number incremental update

    of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing >) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  73. Problems with changing the denition of generation number incremental update

    of commit-graph les requires that new gen(C) be locally updateable which, in addition to performance requirements, means corrected commit creation date commit-graph format includes version number but when creating incremental update code it turned out that Git stops operation (hard fail) if the commit-graph version is newer than supported instead of not using the commit-graph solution: variant of corrected commit date column stores corrected commit date oset its value is chosen to be at least 1 more than maximal oset of the parents of the commit but also in such way that date plus oset is strictly monotonic (strictly increasing >) gives incremental updates, immutability and backward compatibility however it is not implemented yet. . . dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 51 / 68
  74. New generation number: References Incremental update of commit-graph les Derrick

    Stolee (Microsoft) Updates to the Git Commit Graph Feature, Azure DevOps Blog, 11 Nov 2019 httpsXGGdev˜logsFmi™rosoftF™omGdevopsGupd—tesEtoEtheEgitE™ommitEgr—phEfe—tureG Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al. Git Rev News: Edition 52 (June 28th, 2019), Reviews section [PATCH 00/17] [RFC] Commit-graph: Write incremental les httpsXGGgitFgithu˜FioGrev•newsGPHIWGHTGPVGeditionESPG The need for new generation number and its choice Christian Couder, Jakub Nar¦bski, Markus Jansen, Gabriel Alcaras, et.al. Git Rev News: Edition 45 (November 21st, 2018), Support and Reviews sections commit-graph is cool and [RFC] Generation Number v2 httpsXGGgitFgithu˜FioGrev•newsGPHIVGIIGPIGeditionERSG dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 52 / 68
  75. Other graphs and other uses of reachability queries Other types

    of graph data social networks WWW/Internet XML documents biological/chemical networks RDF ontologies Categories of graphs large: |V | 100 000 sparse: |E|/|V | 2 Reachability relations  graph of strongly connected components: =⇒ reachability in DAG Practical use social networks: inuence ow citations: impact of an article internet: link structure analysis security: nding possible connections between suspects biological data: is given protein related directly or indirectly, to a given gene expression? chemical reaction: can you get given compound starting from given substance? Reachability queries in general: classical graph theory problem primitive operation used in other algorithms (like pattern matching) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 53 / 68
  76. Graph of Strongly Connected Components (SCC) dr J. Nar¦bski (UMK,

    Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 54 / 68
  77. Division of algorithms for solving the reachability problem Extreme approaches:

    Computing transitive closure of the graph upfront build time: O(|V |∗|E|) constant time queries: O(1) quadratic memory use: O(|V |2) Online graph search: BFS, bidiBFS, DFS no build time: O(1) query answering time: O(|V |+|E|) no additional memory needed: O(1) Algorithm types: LabelOnly answers queries using labels only nonlinear or unbounded index size Label+G (label + graph) requires [augmented] graph search if labeling could not answer reachability query by itself linear and bounded index build time and index size dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
  78. Division of algorithms for solving the reachability problem Extreme approaches:

    Computing transitive closure of the graph upfront build time: O(|V |∗|E|) constant time queries: O(1) quadratic memory use: O(|V |2) Online graph search: BFS, bidiBFS, DFS no build time: O(1) query answering time: O(|V |+|E|) no additional memory needed: O(1) Algorithm types: LabelOnly answers queries using labels only nonlinear or unbounded index size Label+G (label + graph) requires [augmented] graph search if labeling could not answer reachability query by itself linear and bounded index build time and index size dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 55 / 68
  79. Types of labels in augmented online search algorithms negativecut lter

    (eliminating unreachable nodes) if u ⇝ v and u ̸= v, then labels for u and v fullls the condition e(u) ⪯ e(v) therefore if this condition is not met, then u cannot reach v the reverse is not always true: false positives e(u) ̸⪯ e(v) =⇒ u ̸⇝ v examples: (minimal) generation numer, aka topological level positivecut lter (nding reachable nodes) if for u and v labels we have e′(u) ⪯ e′(v), then u ⇝ v, that is u can reach v node v can be reachable from u event if the condition for labels is not met: false negative e′(u) ⪯ e′(v) =⇒ u ⇝ v examples: min-post interval labeling for the spanning tree (see next slides) Reachability algorithms like FELINE or BFL often use many dierent labels dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 56 / 68
  80. Positivecut: spanning tree Spanning tree / Spanning forest Given directed

    acyclic graph, a spanning forest (tree) is such its subgraph, for which the following is true it includes all original (full) graph nodes there is at most one incoming edge per node Properties: if there exists path from u to v in the spanning tree, then u ⇝ v in a full graph but the path from u to v could require going through edges outside the spanning tree a ⇝ h in graph and in tree b ⇝ h, but not in tree dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 57 / 68
  81. Positivecut: minpost interval labels in the spanning tree minpost interval

    labels For each vertex (node) u in graph we dene minpost interval Lu = [su,eu] in the following way: eu is dened as eu = post(u), postorder value (back traversal) su = eu for leaf nodes (no outcoming edges), otherwise su = min{sx : x ∈ children(s)} Properties: path from u to v in the spanning tree exists if and only if Iv ⊆ Iu if Iv ⊆ Iu, then u ⇝ v (in full graph) The same condition is true for similar postmax intervals [3,3] ⊆ [1,5] then a ⇝ h [3,3] ̸⊆ [7,9], but b ⇝ h dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 58 / 68
  82. Commit graph with the spanning forest Nodes in the graph

    are labeled with post(v) value: postvisit order in depthrst search (DFS) 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 59 / 68
  83. Commit graph with the spanning forest and minpost intervals This

    is the same graph as on previous slide, just drawn dierently (postvisit order vs level) min–post interval Lu = [1, 18] 1 2 3 4 5 6 7 8 9 topological level lv post–visit order post(v) 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 60 / 68
  84. Linux kernel repository commit graph dr J. Nar¦bski (UMK, Toru«)

    Graph operations in Git presented on 03.12.2019 (v1.2) 61 / 68
  85. Interval labeling for reachability queries: References (1/2) Hilmi Yildirim, Vineet

    Chaoji, Mohammed J. Zaki GRAIL: scalable reachability index for large graphs, Proceedings of the VLDB Endowment 3(1):276-284 (2010) httpsXGGwwwFrese—r™hg—teFnetGpu˜li™—tionGPPHSQVUVT•q‚esv•ƒ™—l—˜le•re—™h—˜ility•index•for•l Florian Merz, Peter Sanders PReaCH: A Fast Lightweight Reachability Index using Pruning and Contraction Hierarchies (2014) section 3.3 Pruning Based on DFS Numbering httpsXGG—rxivForgG—˜sGIRHRFRRTS Renê R. Veloso, Loïc Cerf, Wagner Meira Jr, Mohammed J. Zaki Reachability Queries in Very Large Graphs: A Fast Rened Online Search Approach Proc. 17th International Conference on Extending Database Technology (EDBT), March 24-28, Athens, Greece (2014) section 3.4.1 Positive-Cut Filter in 3.4 Optimizations httpXGGopenpro™edingsForgGihf„GPHIRGp—per•ITTFpdf dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 62 / 68
  86. Interval labeling for reachability queries: References (2/2) Stephan Seufert, Avishek

    Anand, Srikanta J. Bedathur, Gerhard Weikum FERRARI: Flexible and Ecient Reachability Range Assignment for Graph Indexing. Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE'13), Brisbane, Australia. IEEE (2013). httpXGG™iteseerxFistFpsuFeduGviewdo™Gdownlo—dcdoiaIHFIFIFQTSFPVWR8reparepI8typeapdf httpsXGGgithu˜F™omGstepsGperr—ri Stephan Seufert, Avishek Anand, Srikanta J. Bedathur, Gerhard Weikum High-Performance Reachability Query Processing under Index Size Restrictions (2012) httpsXGG—rxivForgG—˜sGIPIIFQQUS Jakub Nar¦bski Reachability labels for version control graphs, Google Colaboratory Jupyter Notebook httpsXGG™ol—˜Frese—r™hFgoogleF™omGdriveGI†E…U•sluSQsSiiiwpuhvˆt—xƒuSxyzg dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 63 / 68
  87. Problems to solve how to add additional labels to use

    in graph algorithms in Git? there can be only one order in priority queue (which would be new generation number) would adding positivecut improve performance of graph operations, and if so which ones? returning true/false reachability query result selecting and returning a subset nding node (commit) in a graph nding path / subgraph topological sorting which labels can be computed incrementally? graph of revisions (commit graph) has specic properties and a specic way of growing (dynamics) Online search algorithms Tree+SPPI (2005) GRIPP (2007) GRAIL (2010) (Graph Reachability indexing via rAndomized Interval Labeling) FERRARI (2013) (Flexible and Ecient Reachability Range Assignment for gRaph Indexing) FELINE (2014) (Fast rEned onLINE search) IP (2014) i BFL (2016) (Independent Permutations labeling) (Bloom Filter Labeling) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
  88. Problems to solve how to add additional labels to use

    in graph algorithms in Git? there can be only one order in priority queue (which would be new generation number) would adding positivecut improve performance of graph operations, and if so which ones? returning true/false reachability query result selecting and returning a subset nding node (commit) in a graph nding path / subgraph topological sorting which labels can be computed incrementally? graph of revisions (commit graph) has specic properties and a specic way of growing (dynamics) Online search algorithms Tree+SPPI (2005) GRIPP (2007) GRAIL (2010) (Graph Reachability indexing via rAndomized Interval Labeling) FERRARI (2013) (Flexible and Ecient Reachability Range Assignment for gRaph Indexing) FELINE (2014) (Fast rEned onLINE search) IP (2014) i BFL (2016) (Independent Permutations labeling) (Bloom Filter Labeling) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 64 / 68
  89. Incremental update of minpost interval labels Starting point: commit graph

    at some point in the past, with 3 branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  90. Incremental update of minpost interval labels Beginning of an incremental

    update of labels, starting from one of new branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  91. Incremental update of minpost interval labels Adjusting values of labels:

    adding a constant value for a subtrees (starting from old branch tips) 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 16 17 18 11 13 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  92. Incremental update of minpost interval labels Continuation of an incremental

    update of labels, walking from second of new branch tips 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 16 17 18 11 13 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  93. Incremental update of minpost interval labels Final step: updated labels,

    walking only new commits; giving O(changes) update time 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  94. Incremental update of minpost interval labels This results in dierent

    spanning forest and dierent labels than computed from scratch 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 65 / 68
  95. Advantages and disadvantages of incremental update of interval labels 1

    2 3 4 5 6 7 8 9 2 3 4 5 10 1 6 7 8 9 14 11 12 13 15 17 16 [1, 10] + 0 [11, 14] + 0 [15, 17] + 0 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 1 2 3 4 5 6 7 8 9 2 3 4 5 10 17 18 1 6 7 8 9 19 20 21 23 11 12 13 14 16 22 15 advantages computing update by walking O(changes) commits updating post(v) labels is not more costly than updating graph positions (lexicographical order in updated graph) disadvantages possibly suboptimal reachability labeling spanning forest and interval labels depend on when commit-graph le was updated question is the obtained result of an incremental update good enough (for improving performance)? dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 66 / 68
  96. Incremental commit-graph format and interval labels 1 2 3 4

    5 6 7 8 9 layer = 0 commit-graph file layer = 1 commit-graph file 2 3 4 5 10 14 15 1 6 7 8 9 14 20 21 23 11 12 13 15 17 22 16 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 adjustments Each layer in the commit-graph chain includes corrections (adjustments) to post(v) labels for previous layer in the chain (original values shown) dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
  97. Incremental commit-graph format and interval labels 1 2 3 4

    5 6 7 8 9 layer = 0 commit-graph file layer = 1 commit-graph file 2 3 4 5 10 14 15 1 6 7 8 9 14 20 21 23 11 12 13 15 17 22 16 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 adjustments 1 2 3 4 5 6 7 8 9 2 3 4 5 10 14 15 1 6 7 8 9 19 20 21 23 16 17 18 11 13 22 12 [1, 10] + 0 [11, 14] + 5 [15, 17] − 4 Possible solution: For each layer in the commit-graph chain store (in relevant chunks): minpost interval labels list of tips (heads) in the graph possibly also list of their intervals for each layer in chain, except for base layer, store ajustments for previous layer (only needed for tips) top gure shows data as store in the ™ommitEgr—ph le chain, bottom gure shows eective post(v) labels, as visible from the top layer. dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 67 / 68
  98. Example of large graph: CiteseerX citation network 6 540 401

    nodes 15 011 260 edges 2.295 average degree 567 149 roots 5 740 710 leaves 4.07 ×10 −4 Rratio connected nodes probability 59 max. path length that is max. level dr J. Nar¦bski (UMK, Toru«) Graph operations in Git presented on 03.12.2019 (v1.2) 68 / 68