Git: the NoSQL Database

NoSQL database the by Brandon Keepers

2 million years ago our ancestors started a revolution

http://commons.wikimedia.org/wiki/File:Olduvai_stone_chopping_tool_at_British_Museum.jpg

http://www.ﬂickr.com/photos/birminghammag/6282945952

@bkeepers github.com/bkeepers Hi, I am Brandon

git is amazing at storing code… how well does it
store data?

github.com/bkeepers/gaskit

disclaimer: NoSQL is marketing bollocks

NoSQL non-relational and often schema-less.

Relational PostgreSQL, MySQL NoSQL key/value Riak, Redis, memcached Columnar HBase,
(Cassandra) Document MongoDB, CouchDB Graph Neo4J

1. git as a data store 2. features 3. anti-features

using git as a data store

$ man git

if git is really a database then how do we
store data in it?

the naïve way

the naïve way $ git init mydb && cd mydb
Initialized empty Git repository in mydb/.git/

the naïve way

the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

$ git add 1.json

$ git add 1.json $ git commit -m 'adding 1.json' [master (root-commit) f0e15a1] adding 1.json 1 file changed, 1 insertion(+) create mode 100644 1.json

the naïve way $ git show master:1.json {"name":"Brandon Keepers","company":"GitHub"}

tada! a database if you call the filesystem a database

git’s data model

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit
c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/

tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob #
Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/

tree b: 7041879e README.md e: 0662dca7 public 1b57ea tree blob:
be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob <html> <head> <title>Git</title> … be1b57ea .git/objects/

c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea nce 8 .git/objects/

c67d5118 blob: tree: be1b reference c67d5118 master .git/objects/

c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob <html> <head> <title>Git</title> … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/

c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob <html> <head> <title>Git</title> … be1b57ea blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988

tree b: bc912988 README.md e: 0662dca7 public 02d7b0 blob #
Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988

mmit 1002d7b0 c67d5118 Brandon ommit 7e tree blob: bc912988 README.md
tree: 0662dca7 public 1002d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988

tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea blob: bc912988
README.md tree: 0662dca7 public tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <html <head <ti … be1 blob # Git The stupid content tracker. 7041879e

commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit
c816ef7e tre blob: bc912988 tree: 0662dca7 1002d7b0 reference c816ef7e master

commit tree: be1b57ea parent: nil author: Brandon c67d5118 tree blob:
7041879e R be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit tree blob: bc912988 R tree: 0662dca7 p 1002d7b0 reference c816ef7e

c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 <ht <he < … be blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988

talking to git, programatically.

a few libraries Grit https://github.com/mojombo/grit libgit2 (Ruby, .NET, PHP, Python,
etc) https://github.com/libgit2/libgit2

grit require 'grit' repo = Grit::Repo.new('.')

writing

writing # Get a new index that we can modify
index = repo.index

index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree

index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree # Make our changes index.add('1.json', '{"name": "Brandon"}') index.commit('Add user 1', :parents => [head.commit], :head => 'master')

reading head = repo.get_head('master') blob = head.commit.tree / '1.json' blob.data

that seems like too much work. where’s my ORM?

a few libraries Toystore + adapter-git https://github.com/bkeepers/adapter-git GitModel https://github.com/pauldowman/gitmodel

Toystore + adapter-git class Issue include Toy::Store adapter :git, Grit::Repo.new(GIT_ROOT)
attribute :description, String attribute :state, String, :default => 'open' end

Toystore + adapter-git Issue.create(:description => 'Store in Git') issue =
Issue.get(id) issue.update_attributes(:state => 'in_progress') issue.destroy

features

versioning

diﬀs

hooks update cache, alternate formats, or full-text indexes

question everything about relational data design non-relational

no BDUF optimize storage based on usage patterns

schema-less easily change data as the application evolves

class User # … attribute :first_name, String attribute :last_name, String
end

class User # … # attribute :first_name, String # attribute
:last_name, String attribute :name, String def name super || "#{self[:first_name]} #{self[:last_name]}" end end

transactions a commit can contain many changes

long-lived transactions $ git checkout -b transaction … $ git
checkout master $ git merge transaction

replication every clone contains a full copy

add replica $ git remote add replica1 git@replica1.local:app.git $ cat
.git/hooks/post-commit #!/bin/sh git push replica1

anti- features

yeah, git doesn’t have those. all the features that make
a great DB

querying you can just find it yourself

concurrency why would you want…oooh

concurrency

concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

# Nobody changed anything, right?

# Nobody changed anything, right? index.commit('...', :parents => [head.commit], :head => 'master')

concurrency Lockfile.new('refs/heads/master.lock').lock do index = repo.index head = repo.get_head('master') index.current_tree
= head.commit.tree index.commit('...', :parents => [head.commit], :head => 'master') end

merge conﬂicts $ git merge branch Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85 CONFLICT (content):
Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85 Automatic merge failed; fix conflicts and then commit the result.

git is not web scale

hard write limit $ ruby commits_per_second.rb 97.6538648174529 Commits/Second

paths ma er $ ruby commits_per_second.rb --keys 1000 14.083 Commits/Second
$ ls | head -n 2 00411460f7c92d2124a67ea0f4cb5f85 006f52e9102a8d3be2fe5614f42ba989 $ ls | wc -l 1000

Nest ﬁles in directories $ ruby commits_per_second.rb --keys 1000 --type
nested 67.117 Commits/Second $ tree !"" 0 # !"" 0 # # !"" 411460f7c92d2124a67ea0f4cb5f85 # # !"" 6f52e9102a8d3be2fe5614f42ba989 # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00 # # $"" ec53c4682d36f5c4359f4ae7bd7ba1 # !"" 1 # # !"" 161aaa0b6d1345dd8fe4e481144d84 # # !"" 386bd6d8e091c2ab4c7c7de644d37b # # !"" 3a006f03dbc5392effeb8f18fda755 # # !"" 3d407166ec4fa56eb1e1f8cbe183b9 # # !"" 882513d5fa7c329e940dda99b12147 # # !"" 9d385eb67632a7e958e23f24bd07d7 # # $"" f78be6f7cad02658508fe4616098a9

large repositories with long and storied histories.

git at Facebook From: Joshua Redstone <joshua.redstone <at> fb.com> Subject:
Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. http://thread.gmane.org/gmane.comp.version-control.git/189776

Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million ﬁles 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776

Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million ﬁles 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776 git add: 7 seconds git status: 39 minutes git commit: 41 minutes

if git doesn’t scale, then how does GitHub Scale?

smoke grit, in the cloud.

github.com router file servers rpc

write limit per repo but we have many repos on
many discs.

Invocation

use cases where git would make a good database.

content heavy CMS, translations, wikis

partitionable GitHub, project management

offline

some examples: madrox github.com/technoweenie/madrox gollum github.com/github/gollum gaskit github.com/bkeepers/gaskit

abuse your tools and imagine how to make them better

credits & references Talk by Rick Olson http://git-nosql-rubyconf.heroku.com Peepcode: Git
Internals https://peepcode.com/products/git-internals-pdf

credits & references

questions? @bkeepers github.com/bkeepers speakerdeck.com/bkeepers/git-the-no-sql-database

Git: the NoSQL Database

Git: the NoSQL Database

More Decks by Brandon Keepers

Other Decks in Programming

Featured

Transcript