Git: the NoSQL Database

Git: the NoSQL Database

Check out the video for this talk: https://vimeo.com/56645405

We all know that Git is amazing for storing code. It is fast, reliable, flexible, and it keeps our project history nuzzled safely in its object database while we sleep soundly at night.

But what about storing more than code? Why not data? Much flexibility is gained by ditching traditional databases, but at what cost?

20bfe76b3d6105641f879fe45cfc9272?s=128

Brandon Keepers

April 21, 2012
Tweet

Transcript

  1. 10.
  2. 13.
  3. 16.

    the naïve way $ git init mydb && cd mydb

    Initialized empty Git repository in mydb/.git/
  4. 20.

    the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

    $ git add 1.json $ git commit -m 'adding 1.json' [master (root-commit) f0e15a1] adding 1.json 1 file changed, 1 insertion(+) create mode 100644 1.json
  5. 24.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  6. 25.

    tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob #

    Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  7. 26.

    tree b: 7041879e README.md e: 0662dca7 public 1b57ea tree blob:

    be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  8. 27.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea nce 8 .git/objects/
  9. 28.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 blob: tree: be1b reference c67d5118 master .git/objects/
  10. 29.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  11. 30.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob <html> <head> <title>Git</title> … be1b57ea blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  12. 31.

    tree b: bc912988 README.md e: 0662dca7 public 02d7b0 blob #

    Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  13. 32.

    mmit 1002d7b0 c67d5118 Brandon ommit 7e tree blob: bc912988 README.md

    tree: 0662dca7 public 1002d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  14. 33.

    tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea blob: bc912988

    README.md tree: 0662dca7 public tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <html <head <ti … be1 blob # Git The stupid content tracker. 7041879e
  15. 34.

    commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit

    c816ef7e tre blob: bc912988 tree: 0662dca7 1002d7b0 reference c816ef7e master
  16. 35.

    commit tree: be1b57ea parent: nil author: Brandon c67d5118 tree blob:

    7041879e R be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit tree blob: bc912988 R tree: 0662dca7 p 1002d7b0 reference c816ef7e
  17. 36.

    commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <ht <he < … be blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  18. 40.
  19. 42.

    writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree
  20. 43.

    writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree # Make our changes index.add('1.json', '{"name": "Brandon"}') index.commit('Add user 1', :parents => [head.commit], :head => 'master')
  21. 47.

    Toystore + adapter-git class Issue include Toy::Store adapter :git, Grit::Repo.new(GIT_ROOT)

    attribute :description, String attribute :state, String, :default => 'open' end
  22. 48.

    Toystore + adapter-git Issue.create(:description => 'Store in Git') issue =

    Issue.get(id) issue.update_attributes(:state => 'in_progress') issue.destroy
  23. 49.
  24. 51.
  25. 57.

    class User # … # attribute :first_name, String # attribute

    :last_name, String attribute :name, String def name super || "#{self[:first_name]} #{self[:last_name]}" end end
  26. 59.

    long-lived transactions $ git checkout -b transaction … $ git

    checkout master $ git merge transaction
  27. 61.

    add replica $ git remote add replica1 git@replica1.local:app.git $ cat

    .git/hooks/post-commit #!/bin/sh git push replica1
  28. 69.

    concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

    # Nobody changed anything, right? index.commit('...', :parents => [head.commit], :head => 'master')
  29. 70.

    concurrency Lockfile.new('refs/heads/master.lock').lock do index = repo.index head = repo.get_head('master') index.current_tree

    = head.commit.tree index.commit('...', :parents => [head.commit], :head => 'master') end
  30. 71.

    merge conflicts $ git merge branch Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85 CONFLICT (content):

    Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85 Automatic merge failed; fix conflicts and then commit the result.
  31. 74.

    paths ma er $ ruby commits_per_second.rb --keys 1000 14.083 Commits/Second

    $ ls | head -n 2 00411460f7c92d2124a67ea0f4cb5f85 006f52e9102a8d3be2fe5614f42ba989 $ ls | wc -l 1000
  32. 75.

    Nest files in directories $ ruby commits_per_second.rb --keys 1000 --type

    nested 67.117 Commits/Second $ tree !"" 0 # !"" 0 # # !"" 411460f7c92d2124a67ea0f4cb5f85 # # !"" 6f52e9102a8d3be2fe5614f42ba989 # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00 # # $"" ec53c4682d36f5c4359f4ae7bd7ba1 # !"" 1 # # !"" 161aaa0b6d1345dd8fe4e481144d84 # # !"" 386bd6d8e091c2ab4c7c7de644d37b # # !"" 3a006f03dbc5392effeb8f18fda755 # # !"" 3d407166ec4fa56eb1e1f8cbe183b9 # # !"" 882513d5fa7c329e940dda99b12147 # # !"" 9d385eb67632a7e958e23f24bd07d7 # # $"" f78be6f7cad02658508fe4616098a9
  33. 77.

    git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. http://thread.gmane.org/gmane.comp.version-control.git/189776
  34. 78.

    git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776
  35. 79.

    git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776 git add: 7 seconds git status: 39 minutes git commit: 41 minutes
  36. 81.
  37. 89.
  38. 92.

    credits & references Talk by Rick Olson http://git-nosql-rubyconf.heroku.com Peepcode: Git

    Internals https://peepcode.com/products/git-internals-pdf