Slide 1

Slide 1 text

NoSQL database the by Brandon Keepers

Slide 2

Slide 2 text

2 million years ago our ancestors started a revolution

Slide 3

Slide 3 text

http://commons.wikimedia.org/wiki/File:Olduvai_stone_chopping_tool_at_British_Museum.jpg

Slide 4

Slide 4 text

http://www.flickr.com/photos/birminghammag/6282945952

Slide 5

Slide 5 text

@bkeepers github.com/bkeepers Hi, I am Brandon

Slide 6

Slide 6 text

git is amazing at storing code… how well does it store data?

Slide 7

Slide 7 text

github.com/bkeepers/gaskit

Slide 8

Slide 8 text

disclaimer: NoSQL is marketing bollocks

Slide 9

Slide 9 text

NoSQL non-relational and often schema-less.

Slide 10

Slide 10 text

Relational PostgreSQL, MySQL NoSQL key/value Riak, Redis, memcached Columnar HBase, (Cassandra) Document MongoDB, CouchDB Graph Neo4J

Slide 11

Slide 11 text

1. git as a data store 2. features 3. anti-features

Slide 12

Slide 12 text

using git as a data store

Slide 13

Slide 13 text

$ man git

Slide 14

Slide 14 text

if git is really a database then how do we store data in it?

Slide 15

Slide 15 text

the naïve way

Slide 16

Slide 16 text

the naïve way $ git init mydb && cd mydb Initialized empty Git repository in mydb/.git/

Slide 17

Slide 17 text

the naïve way

Slide 18

Slide 18 text

the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json

Slide 19

Slide 19 text

the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json $ git add 1.json

Slide 20

Slide 20 text

the naïve way $ echo '{"name":"Brandon Keepers","company":"GitHub"}' \ > 1.json $ git add 1.json $ git commit -m 'adding 1.json' [master (root-commit) f0e15a1] adding 1.json 1 file changed, 1 insertion(+) create mode 100644 1.json

Slide 21

Slide 21 text

the naïve way $ git show master:1.json {"name":"Brandon Keepers","company":"GitHub"}

Slide 22

Slide 22 text

tada! a database if you call the filesystem a database

Slide 23

Slide 23 text

git’s data model

Slide 24

Slide 24 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob Git … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/

Slide 25

Slide 25 text

tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob Git … be1b57ea .git/objects/

Slide 26

Slide 26 text

tree b: 7041879e README.md e: 0662dca7 public 1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. blob Git … be1b57ea .git/objects/

Slide 27

Slide 27 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea nce 8 .git/objects/

Slide 28

Slide 28 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 blob: tree: be1b reference c67d5118 master .git/objects/

Slide 29

Slide 29 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob # Git The stupid content tracker. 7041879e blob Git … be1b57ea tree blob: be1b57ea app.css blob: 049fd918 reset.css 2d21ba18 reference c67d5118 master … .git/objects/

Slide 30

Slide 30 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7 blob Git … be1b57ea blob # Git The stupid content tracker. 7041879e 2d21ba18 blob # Git The dumb content tracker. bc912988

Slide 31

Slide 31 text

tree b: bc912988 README.md e: 0662dca7 public 02d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988

Slide 32

Slide 32 text

mmit 1002d7b0 c67d5118 Brandon ommit 7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 blob # Git The stupid content tracker. 7041879e blob # Git The dumb content tracker. bc912988

Slide 33

Slide 33 text

tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea blob: bc912988 README.md tree: 0662dca7 public tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7

Slide 34

Slide 34 text

commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tre blob: bc912988 tree: 0662dca7 1002d7b0 reference c816ef7e master

Slide 35

Slide 35 text

commit tree: be1b57ea parent: nil author: Brandon c67d5118 tree blob: 7041879e R be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit tree blob: bc912988 R tree: 0662dca7 p 1002d7b0 reference c816ef7e

Slide 36

Slide 36 text

commit tree: be1b57ea parent: nil author: Brandon message: Initial commit c67d5118 tree blob: 7041879e README.md tree: 0662dca7 public be1b57ea commit tree: 1002d7b0 parent: c67d5118 author: Brandon message: Initial commit c816ef7e tree blob: bc912988 README.md tree: 0662dca7 public 1002d7b0 reference c816ef7e master tree blob: be1b57ea index.html tree: 2d21ba18 css 0662dca7

Slide 37

Slide 37 text

talking to git, programatically.

Slide 38

Slide 38 text

a few libraries Grit https://github.com/mojombo/grit libgit2 (Ruby, .NET, PHP, Python, etc) https://github.com/libgit2/libgit2

Slide 39

Slide 39 text

grit require 'grit' repo = Grit::Repo.new('.')

Slide 40

Slide 40 text

writing

Slide 41

Slide 41 text

writing # Get a new index that we can modify index = repo.index

Slide 42

Slide 42 text

writing # Get a new index that we can modify index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree

Slide 43

Slide 43 text

writing # Get a new index that we can modify index = repo.index # Get the current tree head = repo.get_head('master') index.current_tree = head.commit.tree # Make our changes index.add('1.json', '{"name": "Brandon"}') index.commit('Add user 1', :parents => [head.commit], :head => 'master')

Slide 44

Slide 44 text

reading head = repo.get_head('master') blob = head.commit.tree / '1.json' blob.data

Slide 45

Slide 45 text

that seems like too much work. where’s my ORM?

Slide 46

Slide 46 text

a few libraries Toystore + adapter-git https://github.com/bkeepers/adapter-git GitModel https://github.com/pauldowman/gitmodel

Slide 47

Slide 47 text

Toystore + adapter-git class Issue include Toy::Store adapter :git, Grit::Repo.new(GIT_ROOT) attribute :description, String attribute :state, String, :default => 'open' end

Slide 48

Slide 48 text

Toystore + adapter-git Issue.create(:description => 'Store in Git') issue = Issue.get(id) issue.update_attributes(:state => 'in_progress') issue.destroy

Slide 49

Slide 49 text

features

Slide 50

Slide 50 text

versioning

Slide 51

Slide 51 text

diffs

Slide 52

Slide 52 text

hooks update cache, alternate formats, or full-text indexes

Slide 53

Slide 53 text

question everything about relational data design non-relational

Slide 54

Slide 54 text

no BDUF optimize storage based on usage patterns

Slide 55

Slide 55 text

schema-less easily change data as the application evolves

Slide 56

Slide 56 text

class User # … attribute :first_name, String attribute :last_name, String end

Slide 57

Slide 57 text

class User # … # attribute :first_name, String # attribute :last_name, String attribute :name, String def name super || "#{self[:first_name]} #{self[:last_name]}" end end

Slide 58

Slide 58 text

transactions a commit can contain many changes

Slide 59

Slide 59 text

long-lived transactions $ git checkout -b transaction … $ git checkout master $ git merge transaction

Slide 60

Slide 60 text

replication every clone contains a full copy

Slide 61

Slide 61 text

add replica $ git remote add replica1 [email protected]:app.git $ cat .git/hooks/post-commit #!/bin/sh git push replica1

Slide 62

Slide 62 text

anti- features

Slide 63

Slide 63 text

yeah, git doesn’t have those. all the features that make a great DB

Slide 64

Slide 64 text

querying you can just find it yourself

Slide 65

Slide 65 text

concurrency why would you want…oooh

Slide 66

Slide 66 text

concurrency

Slide 67

Slide 67 text

concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree

Slide 68

Slide 68 text

concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree # Nobody changed anything, right?

Slide 69

Slide 69 text

concurrency index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree # Nobody changed anything, right? index.commit('...', :parents => [head.commit], :head => 'master')

Slide 70

Slide 70 text

concurrency Lockfile.new('refs/heads/master.lock').lock do index = repo.index head = repo.get_head('master') index.current_tree = head.commit.tree index.commit('...', :parents => [head.commit], :head => 'master') end

Slide 71

Slide 71 text

merge conflicts $ git merge branch Auto-merging 0/0/411460f7c92d2124a67ea0f4cb5f85 CONFLICT (content): Merge conflict in 0/0/411460f7c92d2124a67ea0f4cb5f85 Automatic merge failed; fix conflicts and then commit the result.

Slide 72

Slide 72 text

git is not web scale

Slide 73

Slide 73 text

hard write limit $ ruby commits_per_second.rb 97.6538648174529 Commits/Second

Slide 74

Slide 74 text

paths ma er $ ruby commits_per_second.rb --keys 1000 14.083 Commits/Second $ ls | head -n 2 00411460f7c92d2124a67ea0f4cb5f85 006f52e9102a8d3be2fe5614f42ba989 $ ls | wc -l 1000

Slide 75

Slide 75 text

Nest files in directories $ ruby commits_per_second.rb --keys 1000 --type nested 67.117 Commits/Second $ tree !"" 0 # !"" 0 # # !"" 411460f7c92d2124a67ea0f4cb5f85 # # !"" 6f52e9102a8d3be2fe5614f42ba989 # # !"" ac8ed3b4327bdd4ebbebcb2ba10a00 # # $"" ec53c4682d36f5c4359f4ae7bd7ba1 # !"" 1 # # !"" 161aaa0b6d1345dd8fe4e481144d84 # # !"" 386bd6d8e091c2ab4c7c7de644d37b # # !"" 3a006f03dbc5392effeb8f18fda755 # # !"" 3d407166ec4fa56eb1e1f8cbe183b9 # # !"" 882513d5fa7c329e940dda99b12147 # # !"" 9d385eb67632a7e958e23f24bd07d7 # # $"" f78be6f7cad02658508fe4616098a9

Slide 76

Slide 76 text

large repositories with long and storied histories.

Slide 77

Slide 77 text

git at Facebook From: Joshua Redstone fb.com> Subject: Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. http://thread.gmane.org/gmane.comp.version-control.git/189776

Slide 78

Slide 78 text

git at Facebook From: Joshua Redstone fb.com> Subject: Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776

Slide 79

Slide 79 text

git at Facebook From: Joshua Redstone fb.com> Subject: Git performance results on a large repository Date: 2012-02-03 14:20:06 GMT Hi Git folks, We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 4 million commits 1.3 million files 15 GB http://thread.gmane.org/gmane.comp.version-control.git/189776 git add: 7 seconds git status: 39 minutes git commit: 41 minutes

Slide 80

Slide 80 text

if git doesn’t scale, then how does GitHub Scale?

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

smoke grit, in the cloud.

Slide 83

Slide 83 text

github.com router file servers rpc

Slide 84

Slide 84 text

write limit per repo but we have many repos on many discs.

Slide 85

Slide 85 text

Invocation

Slide 86

Slide 86 text

use cases where git would make a good database.

Slide 87

Slide 87 text

content heavy CMS, translations, wikis

Slide 88

Slide 88 text

partitionable GitHub, project management

Slide 89

Slide 89 text

offline

Slide 90

Slide 90 text

some examples: madrox github.com/technoweenie/madrox gollum github.com/github/gollum gaskit github.com/bkeepers/gaskit

Slide 91

Slide 91 text

abuse your tools and imagine how to make them better

Slide 92

Slide 92 text

credits & references Talk by Rick Olson http://git-nosql-rubyconf.heroku.com Peepcode: Git Internals https://peepcode.com/products/git-internals-pdf

Slide 93

Slide 93 text

credits & references

Slide 94

Slide 94 text

credits & references

Slide 95

Slide 95 text

credits & references

Slide 96

Slide 96 text

questions? @bkeepers github.com/bkeepers speakerdeck.com/bkeepers/git-the-no-sql-database