Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NoSQL and Cassandra

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

NoSQL and Cassandra

Avatar for Ran Tavory

Ran Tavory

May 01, 2012
Tweet

More Decks by Ran Tavory

Other Decks in Programming

Transcript

  1. SQL is good • Rich language • Easy to use

    and integrate • Rich toolset • Many vendors • The promise: ACID ◦ Atomicity ◦ Consistency ◦ Isolation ◦ Durability
  2. BUT

  3. The Challenge: Modern web apps • Internet-scale data size •

    High read-write rates • Frequent schema changes • "social" apps - not banks ◦ They don't need the same level of ACID SCALING
  4. CAP

  5. Taxonomy of NOSQL data stores • Document Oriented ◦ CouchDB,

    MongoDB, Lotus Notes, SimpleDB, Orient • Key-Value ◦ Voldemort, Dynamo, Riak (sort of), Redis, Tokyo • Column ◦ Cassandra, HBase, BigTable • Graph Databases ◦ Neo4J, FlockDB, DEX, AlegroGraph http://en.wikipedia.org/wiki/NoSQL
  6. • Developed at facebook • Follows the BigTable Data Model

    - column oriented • Follows the Dynamo Eventual Consistency model • Opensourced at Apache • Implemented in Java
  7. N/R/W • N - Number of replicas (nodes) for any

    data item • W - Number or nodes a write operation blocks on • R - Number of nodes a read operation blocks on CONSISTENCY DOWN TO EARTH
  8. N/R/W - Typical Values • W=1 => Block until first

    node written successfully • W=N => Block until all nodes written successfully • W=0 => Async writes • R=1 => Block until the first node returns an answer • R=N => Block until all nodes return an answer • R=0 => Doesn't make sense • QUORUM: ◦ R = N/2+1 ◦ W = N/2+1 ◦ => Fully consistent
  9. Data Model - Vocabulary • Keyspace – like namespace for

    unique keys. • Column Family – very much like a table… but not quite. • Key – a key that represent row (of columns) • Column – representation of value with: ◦ Column name ◦ Value ◦ Timestamp • Super Column – Column that holds list of columns inside
  10. Data Model - Columns struct Column { 1: required binary

    name, 2: optional binary value, 3: optional i64 timestamp, 4: optional i32 ttl, } JSON-ish notation: { "name": "emailAddress", "value": "[email protected]", "timestamp": 123456789 }
  11. Data Model - Column Family • Similar to SQL tables

    • Has many columns • Has many rows
  12. Data Model - Rows • Primary key for objects •

    All keys are arbitrary length binaries Users: CF ran: ROW emailAddress: [email protected], COLUMN webSite: http://bar.com COLUMN f.rat: ROW emailAddress: [email protected] COLUMN Stats: CF ran: ROW visits: 243 COLUMN
  13. Data Model - Songs example Songs: Meir Ariel: Shir Keev:

    6:13, Tikva: 4:11, Erol: 6:17 Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: Rakevet Layla: 3:02 Optikai: 5:40
  14. Data Model - Super Columns Songs: Meir Ariel: Shirey Hag:

    Shir Keev: 6:13, Tikva: 4:11, Erol: 6:17 Vegluy Eynaim: Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: ...
  15. The True API get(keyspace, key, column_path, consistency) get_slice(ks, key, column_parent,

    predicate, consistency) multiget(ks, keys, column_path, consistency) multiget_slice(ks, keys, column_parent, predicate, consistency) ...
  16. The API - CQL execute_cql_query cqlsh> SELECT key, state FROM

    users; cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');
  17. Consistency Model • N - per keyspace • R -

    per each read requests • W - per each write request
  18. Java Code TTransport tr = new TSocket("localhost", 9160); TProtocol proto

    = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); String key_user_id = "1"; long timestamp = System.currentTimeMillis(); client.insert("Keyspace1", key_user_id, new ColumnPath("Standard1", null, "name".getBytes("UTF-8")), "Chris Goffinet".getBytes("UTF-8"), timestamp, ConsistencyLevel.ONE);
  19. Java Client - Hector http://github.com/rantav/hector • The de-facto java client

    for cassandra • Encapsulates thrift • Adds JMX (Monitoring) • Connection pooling • Failover • Open-sourced at github and has a growing community of developers and users.
  20. Java Client - Hector - cont /** * Insert a

    new value keyed by key * * @param key Key for the value * @param value the String value to insert */ public void insert(final String key, final String value) { Mutator m = createMutator(keyspaceOperator); m.insert(key, CF_NAME, createColumn(COLUMN_NAME, value)); }
  21. Java Client - Hector - cont /** * Get a

    string value. * * @return The string value; null if no value exists for the given key. */ public String get(final String key) throws HectorException { ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer); Result<HColumn<String, String>> r = q.setKey(key). setName(COLUMN_NAME). setColumnFamily(CF_NAME). execute(); HColumn<String, String> c = r.get(); return c == null ? null : c.getValue(); }
  22. Sorting Columns are sorted by their type • BytesType •

    UTF8Type • AsciiType • LongType • LexicalUUIDType • TimeUUIDType Rows are sorted by their Partitioner • RandomPartitioner • OrderPreservingPartitioner • CollatingOrderPreservingPartitioner
  23. Thrift Cross-language protocol Compiles to: C++, Java, PHP, Ruby, Erlang,

    Perl, ... struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }
  24. Agenda • Background and history • Architectural Layers • Transport:

    Thrift • Write Path (and sstables, memtables) • Read Path • Compactions • Bloom Filters • Gossip • Deletions • More...
  25. From Dynamo: • Symmetric p2p architecture • Gossip based discovery

    and error detection • Distributed key-value store ◦ Pluggable partitioning ◦ Pluggable topology discovery • Eventual consistent and Tunable per operation
  26. From BigTable • Sparse Column oriented sparse array • SSTable

    disk storage ◦ Append-only commit log ◦ Memtable (buffering and sorting) ◦ Immutable sstable files ◦ Compactions ◦ High write performance
  27. Architecture Layers Cluster Management Messaging service Gossip Failure detection Cluster

    state Partitioner Replication Single Host Commit log Memtable SSTable Indexes Compaction
  28. Memtables • In-memory representation of recently written data • When

    the table is full, it's sorted and then flushed to disk -> sstable
  29. SSTables Sorted Strings Tables • Immutable • On-disk • Sorted

    by a string key • In-memory index of elements • Binary search (in memory) to find element location • Bloom filter to reduce number of unneeded binary searches.
  30. Write Properties • No Locks in the critical path •

    Always available to writes, even if there are failures. • No reads • No seeks • Fast • Atomic within a Row
  31. Bloom Filters • Space efficient probabilistic data structure • Test

    whether an element is a member of a set • Allow false positive, but not false negative • k hash functions • Union and intersection are implemented as bitwise OR, AND
  32. Read Properteis • Read multiple SSTables • Slower than writes

    (but still fast) • Seeks can be mitigated with more RAM • Uses probabilistic bloom filters to reduce lookups. • Extensive optional caching ◦ Key Cache ◦ Row Cache ◦ Excellent monitoring
  33. Gossip •p2p •Enables seamless nodes addition. •Rebalancing of keys •Fast

    detection of nodes that goes down. •Every node knows about all others - no master.
  34. Deletions • Deletion marker (tombstone) necessary to suppress data in

    older SSTables, until compaction • Read repair complicates things a little • Eventually consistent complicates things more • Solution: configurable delay before tombstone GC, after which tombstones are not repaired
  35. Extra Long list of subjects • SEDA (Staged Events Driven

    Architecture) • Anti Entropy and Merkle Trees • Hinted Handoff • repair on read
  36. SEDA • Mutate • Stream • Gossip • Response •

    Anti Entropy • Load Balance • Migration
  37. References • http://horicky.blogspot.com/2009/11/nosql-patterns.html • http://s3.amazonaws. com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007. pdf • http://labs.google.com/papers/bigtable.html •

    http://bret.appspot.com/entry/how-friendfeed-uses-mysql • http://www.julianbrowne.com/article/viewer/brewers-cap- theorem • http://www.allthingsdistributed. com/2008/12/eventually_consistent.html • http://wiki.apache.org/cassandra/DataModel • http://incubator.apache.org/thrift/ • http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf