Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Git: the NoSQL Database

Git: the NoSQL Database

Check out the video for this talk: https://vimeo.com/56645405

We all know that Git is amazing for storing code. It is fast, reliable, flexible, and it keeps our project history nuzzled safely in its object database while we sleep soundly at night.

But what about storing more than code? Why not data? Much flexibility is gained by ditching traditional databases, but at what cost?

Brandon Keepers

April 21, 2012
Tweet

More Decks by Brandon Keepers

Other Decks in Programming

Transcript

  1. the naïve way $ git init mydb && cd mydb

    Initialized empty Git repository in mydb/.git/
  2. the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

    $ git add 1.json $ git commit -m 'adding 1.json' [master (root-commit) f0e15a1] adding 1.json 1 file changed, 1 insertion(+) create mode 100644 1.json
  3. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  4. tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob #

    Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  5. tree b: 7041879e README.md e: 0662dca7 public 1b57ea tree blob:

    be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  6. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea nce 8 .git/objects/
  7. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 blob: tree: be1b reference c67d5118 master .git/objects/
  8. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  9. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob <html> <head> <title>Git</title> … be1b57ea blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  10. tree b: bc912988 README.md e: 0662dca7 public 02d7b0 blob #

    Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  11. mmit 1002d7b0 c67d5118 Brandon ommit 7e tree blob: bc912988 README.md

    tree: 0662dca7 public 1002d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  12. tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea blob: bc912988

    README.md tree: 0662dca7 public tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <html <head <ti … be1 blob # Git The stupid content tracker. 7041879e
  13. commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit

    c816ef7e tre blob: bc912988 tree: 0662dca7 1002d7b0 reference c816ef7e master
  14. commit tree: be1b57ea parent: nil author: Brandon c67d5118 tree blob:

    7041879e R be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit tree blob: bc912988 R tree: 0662dca7 p 1002d7b0 reference c816ef7e
  15. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <ht <he < … be blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  16. writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree
  17. writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree # Make our changes index.add('1.json', '{"name": "Brandon"}') index.commit('Add user 1', :parents => [head.commit], :head => 'master')
  18. Toystore + adapter-git class Issue include Toy::Store adapter :git, Grit::Repo.new(GIT_ROOT)

    attribute :description, String attribute :state, String, :default => 'open' end
  19. Toystore + adapter-git Issue.create(:description => 'Store in Git') issue =

    Issue.get(id) issue.update_attributes(:state => 'in_progress') issue.destroy
  20. class User # … # attribute :first_name, String # attribute

    :last_name, String attribute :name, String def name super || "#{self[:first_name]} #{self[:last_name]}" end end
  21. long-lived transactions $ git checkout -b transaction … $ git

    checkout master $ git merge transaction
  22. add replica $ git remote add replica1 [email protected]:app.git $ cat

    .git/hooks/post-commit #!/bin/sh git push replica1
  23. concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

    # Nobody changed anything, right? index.commit('...', :parents => [head.commit], :head => 'master')
  24. concurrency Lockfile.new('refs/heads/master.lock').lock do index = repo.index head = repo.get_head('master') index.current_tree

    = head.commit.tree index.commit('...', :parents => [head.commit], :head => 'master') end
  25. merge conflicts $ git merge branch Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85 CONFLICT (content):

    Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85 Automatic merge failed; fix conflicts and then commit the result.
  26. paths ma er $ ruby commits_per_second.rb --keys 1000 14.083 Commits/Second

    $ ls | head -n 2 00411460f7c92d2124a67ea0f4cb5f85 006f52e9102a8d3be2fe5614f42ba989 $ ls | wc -l 1000
  27. Nest files in directories $ ruby commits_per_second.rb --keys 1000 --type

    nested 67.117 Commits/Second $ tree !"" 0 # !"" 0 # # !"" 411460f7c92d2124a67ea0f4cb5f85 # # !"" 6f52e9102a8d3be2fe5614f42ba989 # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00 # # $"" ec53c4682d36f5c4359f4ae7bd7ba1 # !"" 1 # # !"" 161aaa0b6d1345dd8fe4e481144d84 # # !"" 386bd6d8e091c2ab4c7c7de644d37b # # !"" 3a006f03dbc5392effeb8f18fda755 # # !"" 3d407166ec4fa56eb1e1f8cbe183b9 # # !"" 882513d5fa7c329e940dda99b12147 # # !"" 9d385eb67632a7e958e23f24bd07d7 # # $"" f78be6f7cad02658508fe4616098a9
  28. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. http://thread.gmane.org/gmane.comp.version-control.git/189776
  29. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776
  30. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776 git add: 7 seconds git status: 39 minutes git commit: 41 minutes
  31. credits & references Talk by Rick Olson http://git-nosql-rubyconf.heroku.com Peepcode: Git

    Internals https://peepcode.com/products/git-internals-pdf