Slide 1

Slide 1 text

Not only SQL Mårten Gustafson http://marten.gustafson.pp.se/ Qbranch CODE tech-meet @ 2010-04-14 Thursday, April 15, 2010

Slide 2

Slide 2 text

What? “NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases” - Wikipedia Thursday, April 15, 2010

Slide 3

Slide 3 text

What? “NoSQL is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases” - Wikipedia Not a single technique Not a single type of data Not a single type of use case Thursday, April 15, 2010

Slide 4

Slide 4 text

Why? • Non-relational • Schema-less • “Easily” scalable • REST/JSON API = web friendly Thursday, April 15, 2010

Slide 5

Slide 5 text

What’s out there? Storage type License Implemented in Amazon Dynamo Key/Value n/a ? Cassandra Columnfamily ASL 2.0 Java CouchDB Document ASL 2.0 Erlang Dynomite Key/Value BSD/MIT-style Erlang HBase Columnfamily ASL 2.0 Java MongoDB Document AGPL v3.0 C++ Neo4J Graph AGPL v3.0 / Comm Java Riak Key/Value ASL 2.0 Erlang Redis Key/Value BSD/MIT-style C Scalaris Key/Value ASL 2.0 Erlang Tokyo Cabinet Key/Value LGPL C Voldemort Key/Value ASL 2.0 Java Thursday, April 15, 2010

Slide 6

Slide 6 text

Distribution • Master / Slave • Master / Slave(s) • Masterless (Master / Master) Thursday, April 15, 2010

Slide 7

Slide 7 text

Distribution Masterless Master/Slave Hot standby Amazon Dynamo X Cassandra X CouchDB X Dynomite X HBase ? MongoDB X X Neo4J* Riak X Redis X Scalaris X Tokyo Cabinet Voldemort X * Neo4J HA coming “soon” Thursday, April 15, 2010

Slide 8

Slide 8 text

Common factor “...of the web...” Of the who?! Thursday, April 15, 2010

Slide 9

Slide 9 text

Of the web “...Django may be built for the Web, but CouchDB is built of the Web. I’ve never seen software that so completely embraces the philosophies behind HTTP. CouchDB makes Django look old-school in the same way that Django makes ASP look outdated” - http://jacobian.org/writing/of-the-web/ Thursday, April 15, 2010

Slide 10

Slide 10 text

Of the web “...CouchDB may succeeded, and it may fail; who knows. I’m sure of one thing, though — this is what the software of the future looks like” - http://jacobian.org/writing/of-the-web/ Thursday, April 15, 2010

Slide 11

Slide 11 text

So freakin’ what?! All your webish skillz and tools apply... Thursday, April 15, 2010

Slide 12

Slide 12 text

So freakin’ what?! All your webish skillz and tools apply... proxies load balancers caches HTTP client libs (etag, if-modified-since, etc) language-, platform- and OS-neutral MIME / Content-Type Thursday, April 15, 2010

Slide 13

Slide 13 text

These guys can just suck it HTTP/REST is integration that works (YMMV) Thursday, April 15, 2010

Slide 14

Slide 14 text

Buckle Up Dorothy. Cause' Kansas, Is Going Bye-Bye Thursday, April 15, 2010

Slide 15

Slide 15 text

I got keys but no locks Thursday, April 15, 2010

Slide 16

Slide 16 text

Riak Decentralized key-value store A flexible map/reduce engine HTTP/JSON API A database ideally suited for Web applications Thursday, April 15, 2010

Slide 17

Slide 17 text

The Ring Thursday, April 15, 2010

Slide 18

Slide 18 text

The Ring ring size = 12 1 2 3 4 5 6 7 8 9 10 11 12 Thursday, April 15, 2010

Slide 19

Slide 19 text

The Ring One Ring size to rule them all, One Ring size to find them, One Ring size to bring them all and in the cluster bind them... Thursday, April 15, 2010

Slide 20

Slide 20 text

Consistent Hashing Store/Save (PUT) Thursday, April 15, 2010

Slide 21

Slide 21 text

Consistent Hashing Store/Save (PUT) Thursday, April 15, 2010

Slide 22

Slide 22 text

Consistent Hashing Read (GET) “I want “ is answered by: where is on the ring? Thursday, April 15, 2010

Slide 23

Slide 23 text

Consistent Hashing Read (GET) “I want “ is answered by: where is on the ring? Thursday, April 15, 2010

Slide 24

Slide 24 text

Cluster Instance A Instance B Instance C ring size = 12 instances = 3 ring size / nodes = ~slices per instances Thursday, April 15, 2010

Slide 25

Slide 25 text

Cluster Instance A Instance B Instance C ring size = 12 instances = 3 ring size / nodes = ~slices per instances Thursday, April 15, 2010

Slide 26

Slide 26 text

Cluster - Read (GET) Instance A Instance B Instance C Thursday, April 15, 2010

Slide 27

Slide 27 text

Cluster - Read (GET) Instance A Instance B Instance C I can haz ? Hm, lives in a slice of the ring owned by instance C. Thursday, April 15, 2010

Slide 28

Slide 28 text

Cluster - Read (GET) Instance A Instance B Instance C Okidoki, now where’s he...a yeah in my fourth slice I can haz ? Hey C! I need Thursday, April 15, 2010

Slide 29

Slide 29 text

Cluster - Read (GET) Instance A Instance B Instance C Here ya go I can haz ? Cheers! Thursday, April 15, 2010

Slide 30

Slide 30 text

Riak “stuff” Thursday, April 15, 2010

Slide 31

Slide 31 text

Riak “stuff” Bucket Container/keyspace. Determines number of replicas for its contents Thursday, April 15, 2010

Slide 32

Slide 32 text

Riak “stuff” Bucket Consistent Hashing Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Thursday, April 15, 2010

Slide 33

Slide 33 text

Riak “stuff” Bucket Consistent Hashing Gossiping Shares state, bucket and ring knowledge in the cluster Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Thursday, April 15, 2010

Slide 34

Slide 34 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Shares state, bucket and ring knowledge in the cluster Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 35

Slide 35 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 36

Slide 36 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects Data structure for efficient summary about keys. Gossiped. Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 37

Slide 37 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects Data structure for efficient summary about keys. Gossiped. One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 38

Slide 38 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 39

Slide 39 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Read Repair Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the Auto correction of out-of-date objects Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 40

Slide 40 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Read Repair Replica Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. Number of copies of the same object in the cluster One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the Auto correction of out-of-date objects Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 41

Slide 41 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Read Repair Replica Ring Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. Number of copies of the same object in the cluster One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the The complete “space”, divided into partitions which are claimed by vnodes Auto correction of out-of-date objects Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Thursday, April 15, 2010

Slide 42

Slide 42 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Read Repair Replica Ring Vector Clock Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. Number of copies of the same object in the cluster One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the The complete “space”, divided into partitions which are claimed by vnodes Auto correction of out-of-date objects Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Version control technique used for objects. Thursday, April 15, 2010

Slide 43

Slide 43 text

Riak “stuff” Bucket Consistent Hashing Gossiping Hinted Handoff Links Merkle Tree Node Partition Read Repair Replica Ring Vector Clock Vnode Shares state, bucket and ring knowledge in the cluster Allows retrieval of “weakly” linked objects Runs in a node and claims one partition in the ring One slice (part) of the ring. Data structure for efficient summary about keys. Gossiped. Number of copies of the same object in the cluster One server. Runs vnodes which claims partitions. Key hashing technique used to distribute keys on the The complete “space”, divided into partitions which are claimed by vnodes Auto correction of out-of-date objects Container/keyspace. Determines number of replicas for its contents Covering for a failed “neighbor” node while gone Version control technique used for objects. Thursday, April 15, 2010

Slide 44

Slide 44 text

Riak - Takeaways • No single point of failure • Choose your levels for: • availability • consistency • partition tolerance Thursday, April 15, 2010

Slide 45

Slide 45 text

But wait, there’s more... • Binary data + Content-Type = whatever • MP3’s, Images, Text, ... • Map/Reduce • Local data, parallel Thursday, April 15, 2010

Slide 46

Slide 46 text

This slide intentionally left blank Thursday, April 15, 2010

Slide 47

Slide 47 text

Document Store Relax Thursday, April 15, 2010

Slide 48

Slide 48 text

CouchDB Document oriented databased Kick ass replication HTTP/JSON API Map/reduce view (index) definitions Thursday, April 15, 2010

Slide 49

Slide 49 text

World view One document == JSON One document == One record Many Documents == One database No schema Thursday, April 15, 2010

Slide 50

Slide 50 text

A document { "_id": "b098445d587b1f347e48e1a79301de02", "_rev": "1-80bfd8302e0f08eec2396c8107cafc19", "platform": { "browser": "mozilla", "version": "1.9.1.8" }, "timestamp": 1270131033337 } Key, either you choose it or CouchDB does it for you Revision number Thursday, April 15, 2010

Slide 51

Slide 51 text

Views Filter Collate Aggregate Thursday, April 15, 2010

Slide 52

Slide 52 text

Views { "_id": "b098445d587b1f347e48e1a79301de02", "_rev": "1-80bfd8302e0f08eec2396c8107cafc19", "platform": { "browser": "mozilla", "version": "1.9.1.8" }, "timestamp": 1270131033337 } + function(doc) { emit(doc.platform.browser, doc.browser.version); } = { "total_rows": 58, "offset": 0, "rows": [ "id": "b098445d587b1f347e48e1a79301de02", "key": "mozilla", "value": "1.9.1.8" ] } Thursday, April 15, 2010

Slide 53

Slide 53 text

Views Views are stored as an accessible web resource on disk and incrementally updated as well as replicated with the database Thursday, April 15, 2010

Slide 54

Slide 54 text

Replication Peer to peer Online/Offline Conflict detection and resolution Any number of nodes Local Remote Thursday, April 15, 2010

Slide 55

Slide 55 text

Replication Thursday, April 15, 2010

Slide 56

Slide 56 text

Replication Thursday, April 15, 2010

Slide 57

Slide 57 text

Replication Thursday, April 15, 2010

Slide 58

Slide 58 text

Replication Thursday, April 15, 2010

Slide 59

Slide 59 text

Replication Thursday, April 15, 2010

Slide 60

Slide 60 text

CouchDB “stuff” Thursday, April 15, 2010

Slide 61

Slide 61 text

CouchDB “stuff” Append only Hence, won’t corrupt its data files Thursday, April 15, 2010

Slide 62

Slide 62 text

CouchDB “stuff” MVCC Append only Multi version concurrency control. Writers do not block readers. Readers do not block Hence, won’t corrupt its data files Thursday, April 15, 2010

Slide 63

Slide 63 text

CouchDB “stuff” MVCC Append only BDCRR Multi version concurrency control. Writers do not block readers. Readers do not block Bi-directional, conflict resolving, replication Hence, won’t corrupt its data files Thursday, April 15, 2010

Slide 64

Slide 64 text

CouchDB “stuff” MVCC Append only Compaction BDCRR Multi version concurrency control. Writers do not block readers. Readers do not block Bi-directional, conflict resolving, replication Hence, won’t corrupt its data files Append only will cause data files to grow. Compaction to the rescue, in the background - for your pleasure. Thursday, April 15, 2010

Slide 65

Slide 65 text

CouchDB “stuff” MVCC Append only Compaction ACID BDCRR Multi version concurrency control. Writers do not block readers. Readers do not block Bi-directional, conflict resolving, replication Hence, won’t corrupt its data files Awesome, Cool, Impressive, Dope Append only will cause data files to grow. Compaction to the rescue, in the background - for your pleasure. Thursday, April 15, 2010

Slide 66

Slide 66 text

CouchDB - Takeaways • Kick ass replication • Views are fast • Can host and serve complete webapps Thursday, April 15, 2010

Slide 67

Slide 67 text

Outro • Test one or more NoSQL thingys • Get familiar with Brewers CAP theorem • Get familiar with the Dynamo paper Thursday, April 15, 2010

Slide 68

Slide 68 text

Over and out. Mårten Gustafson @martengustafson http://marten.gustafson.pp.se/ [email protected] Thursday, April 15, 2010