Pro Yearly is on sale from $80 to $50! »

Git: the NoSQL Database

Git: the NoSQL Database

Check out the video for this talk: https://vimeo.com/56645405

We all know that Git is amazing for storing code. It is fast, reliable, flexible, and it keeps our project history nuzzled safely in its object database while we sleep soundly at night.

But what about storing more than code? Why not data? Much flexibility is gained by ditching traditional databases, but at what cost?

20bfe76b3d6105641f879fe45cfc9272?s=128

Brandon Keepers

April 21, 2012
Tweet

Transcript

  1. NoSQL database the by Brandon Keepers

  2. 2 million years ago our ancestors started a revolution

  3. http://commons.wikimedia.org/wiki/File:Olduvai_stone_chopping_tool_at_British_Museum.jpg

  4. http://www.flickr.com/photos/birminghammag/6282945952

  5. @bkeepers github.com/bkeepers Hi, I am Brandon

  6. git is amazing at storing code… how well does it

    store data?
  7. github.com/bkeepers/gaskit

  8. disclaimer: NoSQL is marketing bollocks

  9. NoSQL non-relational and often schema-less.

  10. Relational PostgreSQL, MySQL NoSQL key/value Riak, Redis, memcached Columnar HBase,

    (Cassandra) Document MongoDB, CouchDB Graph Neo4J
  11. 1. git as a data store 2. features 3. anti-features

  12. using git as a data store

  13. $ man git

  14. if git is really a database then how do we

    store data in it?
  15. the naïve way

  16. the naïve way $ git init mydb && cd mydb

    Initialized empty Git repository in mydb/.git/
  17. the naïve way

  18. the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

  19. the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

    $ git add 1.json
  20. the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

    $ git add 1.json $ git commit -m 'adding 1.json' [master (root-commit) f0e15a1] adding 1.json 1 file changed, 1 insertion(+) create mode 100644 1.json
  21. the naïve way $ git show master:1.json {"name":"Brandon Keepers","company":"GitHub"}

  22. tada! a database if you call the filesystem a database

  23. git’s data model

  24. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  25. tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob #

    Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  26. tree b: 7041879e README.md e: 0662dca7 public 1b57ea tree blob:

    be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/
  27. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea nce 8 .git/objects/
  28. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 blob: tree: be1b reference c67d5118 master .git/objects/
  29. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/
  30. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob <html> <head> <title>Git</title> … be1b57ea blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  31. tree b: bc912988 README.md e: 0662dca7 public 02d7b0 blob #

    Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  32. mmit 1002d7b0 c67d5118 Brandon ommit 7e tree blob: bc912988 README.md

    tree: 0662dca7 public 1002d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988
  33. tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea blob: bc912988

    README.md tree: 0662dca7 public tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <html <head <ti … be1 blob # Git The stupid content tracker. 7041879e
  34. commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit

    c816ef7e tre blob: bc912988 tree: 0662dca7 1002d7b0 reference c816ef7e master
  35. commit tree: be1b57ea parent: nil author: Brandon c67d5118 tree blob:

    7041879e R be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit tree blob: bc912988 R tree: 0662dca7 p 1002d7b0 reference c816ef7e
  36. commit tree: be1b57ea parent: nil author: Brandon message: Initial commit

    c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <ht <he < … be blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988
  37. talking to git, programatically.

  38. a few libraries Grit https://github.com/mojombo/grit libgit2 (Ruby, .NET, PHP, Python,

    etc) https://github.com/libgit2/libgit2
  39. grit require 'grit' repo = Grit::Repo.new('.')

  40. writing

  41. writing # Get a new index that we can modify

    index = repo.index
  42. writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree
  43. writing # Get a new index that we can modify

    index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree # Make our changes index.add('1.json', '{"name": "Brandon"}') index.commit('Add user 1', :parents => [head.commit], :head => 'master')
  44. reading head = repo.get_head('master') blob = head.commit.tree / '1.json' blob.data

  45. that seems like too much work. where’s my ORM?

  46. a few libraries Toystore + adapter-git https://github.com/bkeepers/adapter-git GitModel https://github.com/pauldowman/gitmodel

  47. Toystore + adapter-git class Issue include Toy::Store adapter :git, Grit::Repo.new(GIT_ROOT)

    attribute :description, String attribute :state, String, :default => 'open' end
  48. Toystore + adapter-git Issue.create(:description => 'Store in Git') issue =

    Issue.get(id) issue.update_attributes(:state => 'in_progress') issue.destroy
  49. features

  50. versioning

  51. diffs

  52. hooks update cache, alternate formats, or full-text indexes

  53. question everything about relational data design non-relational

  54. no BDUF optimize storage based on usage patterns

  55. schema-less easily change data as the application evolves

  56. class User # … attribute :first_name, String attribute :last_name, String

    end
  57. class User # … # attribute :first_name, String # attribute

    :last_name, String attribute :name, String def name super || "#{self[:first_name]} #{self[:last_name]}" end end
  58. transactions a commit can contain many changes

  59. long-lived transactions $ git checkout -b transaction … $ git

    checkout master $ git merge transaction
  60. replication every clone contains a full copy

  61. add replica $ git remote add replica1 git@replica1.local:app.git $ cat

    .git/hooks/post-commit #!/bin/sh git push replica1
  62. anti- features

  63. yeah, git doesn’t have those. all the features that make

    a great DB
  64. querying you can just find it yourself

  65. concurrency why would you want…oooh

  66. concurrency

  67. concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

  68. concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

    # Nobody changed anything, right?
  69. concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

    # Nobody changed anything, right? index.commit('...', :parents => [head.commit], :head => 'master')
  70. concurrency Lockfile.new('refs/heads/master.lock').lock do index = repo.index head = repo.get_head('master') index.current_tree

    = head.commit.tree index.commit('...', :parents => [head.commit], :head => 'master') end
  71. merge conflicts $ git merge branch Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85 CONFLICT (content):

    Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85 Automatic merge failed; fix conflicts and then commit the result.
  72. git is not web scale

  73. hard write limit $ ruby commits_per_second.rb 97.6538648174529 Commits/Second

  74. paths ma er $ ruby commits_per_second.rb --keys 1000 14.083 Commits/Second

    $ ls | head -n 2 00411460f7c92d2124a67ea0f4cb5f85 006f52e9102a8d3be2fe5614f42ba989 $ ls | wc -l 1000
  75. Nest files in directories $ ruby commits_per_second.rb --keys 1000 --type

    nested 67.117 Commits/Second $ tree !"" 0 # !"" 0 # # !"" 411460f7c92d2124a67ea0f4cb5f85 # # !"" 6f52e9102a8d3be2fe5614f42ba989 # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00 # # $"" ec53c4682d36f5c4359f4ae7bd7ba1 # !"" 1 # # !"" 161aaa0b6d1345dd8fe4e481144d84 # # !"" 386bd6d8e091c2ab4c7c7de644d37b # # !"" 3a006f03dbc5392effeb8f18fda755 # # !"" 3d407166ec4fa56eb1e1f8cbe183b9 # # !"" 882513d5fa7c329e940dda99b12147 # # !"" 9d385eb67632a7e958e23f24bd07d7 # # $"" f78be6f7cad02658508fe4616098a9
  76. large repositories with long and storied histories.

  77. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. http://thread.gmane.org/gmane.comp.version-control.git/189776
  78. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776
  79. git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:

    Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776 git add: 7 seconds git status: 39 minutes git commit: 41 minutes
  80. if git doesn’t scale, then how does GitHub Scale?

  81. None
  82. smoke grit, in the cloud.

  83. github.com router file servers rpc

  84. write limit per repo but we have many repos on

    many discs.
  85. Invocation

  86. use cases where git would make a good database.

  87. content heavy CMS, translations, wikis

  88. partitionable GitHub, project management

  89. offline

  90. some examples: madrox github.com/technoweenie/madrox gollum github.com/github/gollum gaskit github.com/bkeepers/gaskit

  91. abuse your tools and imagine how to make them better

  92. credits & references Talk by Rick Olson http://git-nosql-rubyconf.heroku.com Peepcode: Git

    Internals https://peepcode.com/products/git-internals-pdf
  93. credits & references

  94. credits & references

  95. credits & references

  96. questions? @bkeepers github.com/bkeepers speakerdeck.com/bkeepers/git-the-no-sql-database