Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NoSQL and Cassandra

NoSQL and Cassandra

Ran Tavory

May 01, 2012
Tweet

More Decks by Ran Tavory

Other Decks in Programming

Transcript

  1. SQL is good • Rich language • Easy to use

    and integrate • Rich toolset • Many vendors • The promise: ACID ◦ Atomicity ◦ Consistency ◦ Isolation ◦ Durability
  2. BUT

  3. The Challenge: Modern web apps • Internet-scale data size •

    High read-write rates • Frequent schema changes • "social" apps - not banks ◦ They don't need the same level of ACID SCALING
  4. CAP

  5. Taxonomy of NOSQL data stores • Document Oriented ◦ CouchDB,

    MongoDB, Lotus Notes, SimpleDB, Orient • Key-Value ◦ Voldemort, Dynamo, Riak (sort of), Redis, Tokyo • Column ◦ Cassandra, HBase, BigTable • Graph Databases ◦ Neo4J, FlockDB, DEX, AlegroGraph http://en.wikipedia.org/wiki/NoSQL
  6. • Developed at facebook • Follows the BigTable Data Model

    - column oriented • Follows the Dynamo Eventual Consistency model • Opensourced at Apache • Implemented in Java
  7. N/R/W • N - Number of replicas (nodes) for any

    data item • W - Number or nodes a write operation blocks on • R - Number of nodes a read operation blocks on CONSISTENCY DOWN TO EARTH
  8. N/R/W - Typical Values • W=1 => Block until first

    node written successfully • W=N => Block until all nodes written successfully • W=0 => Async writes • R=1 => Block until the first node returns an answer • R=N => Block until all nodes return an answer • R=0 => Doesn't make sense • QUORUM: ◦ R = N/2+1 ◦ W = N/2+1 ◦ => Fully consistent
  9. Data Model - Vocabulary • Keyspace – like namespace for

    unique keys. • Column Family – very much like a table… but not quite. • Key – a key that represent row (of columns) • Column – representation of value with: ◦ Column name ◦ Value ◦ Timestamp • Super Column – Column that holds list of columns inside
  10. Data Model - Columns struct Column { 1: required binary

    name, 2: optional binary value, 3: optional i64 timestamp, 4: optional i32 ttl, } JSON-ish notation: { "name": "emailAddress", "value": "[email protected]", "timestamp": 123456789 }
  11. Data Model - Column Family • Similar to SQL tables

    • Has many columns • Has many rows
  12. Data Model - Rows • Primary key for objects •

    All keys are arbitrary length binaries Users: CF ran: ROW emailAddress: [email protected], COLUMN webSite: http://bar.com COLUMN f.rat: ROW emailAddress: [email protected] COLUMN Stats: CF ran: ROW visits: 243 COLUMN
  13. Data Model - Songs example Songs: Meir Ariel: Shir Keev:

    6:13, Tikva: 4:11, Erol: 6:17 Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: Rakevet Layla: 3:02 Optikai: 5:40
  14. Data Model - Super Columns Songs: Meir Ariel: Shirey Hag:

    Shir Keev: 6:13, Tikva: 4:11, Erol: 6:17 Vegluy Eynaim: Suetz: 5:30 Dr Hitchakmut: 3:30 Mashina: ...
  15. The True API get(keyspace, key, column_path, consistency) get_slice(ks, key, column_parent,

    predicate, consistency) multiget(ks, keys, column_path, consistency) multiget_slice(ks, keys, column_parent, predicate, consistency) ...
  16. The API - CQL execute_cql_query cqlsh> SELECT key, state FROM

    users; cqlsh> INSERT INTO users (key, full_name, birth_date, state) VALUES ('bsanderson', 'Brandon Sanderson', 1975, 'UT');
  17. Consistency Model • N - per keyspace • R -

    per each read requests • W - per each write request
  18. Java Code TTransport tr = new TSocket("localhost", 9160); TProtocol proto

    = new TBinaryProtocol(tr); Cassandra.Client client = new Cassandra.Client(proto); tr.open(); String key_user_id = "1"; long timestamp = System.currentTimeMillis(); client.insert("Keyspace1", key_user_id, new ColumnPath("Standard1", null, "name".getBytes("UTF-8")), "Chris Goffinet".getBytes("UTF-8"), timestamp, ConsistencyLevel.ONE);
  19. Java Client - Hector http://github.com/rantav/hector • The de-facto java client

    for cassandra • Encapsulates thrift • Adds JMX (Monitoring) • Connection pooling • Failover • Open-sourced at github and has a growing community of developers and users.
  20. Java Client - Hector - cont /** * Insert a

    new value keyed by key * * @param key Key for the value * @param value the String value to insert */ public void insert(final String key, final String value) { Mutator m = createMutator(keyspaceOperator); m.insert(key, CF_NAME, createColumn(COLUMN_NAME, value)); }
  21. Java Client - Hector - cont /** * Get a

    string value. * * @return The string value; null if no value exists for the given key. */ public String get(final String key) throws HectorException { ColumnQuery<String, String> q = createColumnQuery(keyspaceOperator, serializer, serializer); Result<HColumn<String, String>> r = q.setKey(key). setName(COLUMN_NAME). setColumnFamily(CF_NAME). execute(); HColumn<String, String> c = r.get(); return c == null ? null : c.getValue(); }
  22. Sorting Columns are sorted by their type • BytesType •

    UTF8Type • AsciiType • LongType • LexicalUUIDType • TimeUUIDType Rows are sorted by their Partitioner • RandomPartitioner • OrderPreservingPartitioner • CollatingOrderPreservingPartitioner
  23. Thrift Cross-language protocol Compiles to: C++, Java, PHP, Ruby, Erlang,

    Perl, ... struct UserProfile { 1: i32 uid, 2: string name, 3: string blurb } service UserStorage { void store(1: UserProfile user), UserProfile retrieve(1: i32 uid) }
  24. Agenda • Background and history • Architectural Layers • Transport:

    Thrift • Write Path (and sstables, memtables) • Read Path • Compactions • Bloom Filters • Gossip • Deletions • More...
  25. From Dynamo: • Symmetric p2p architecture • Gossip based discovery

    and error detection • Distributed key-value store ◦ Pluggable partitioning ◦ Pluggable topology discovery • Eventual consistent and Tunable per operation
  26. From BigTable • Sparse Column oriented sparse array • SSTable

    disk storage ◦ Append-only commit log ◦ Memtable (buffering and sorting) ◦ Immutable sstable files ◦ Compactions ◦ High write performance
  27. Architecture Layers Cluster Management Messaging service Gossip Failure detection Cluster

    state Partitioner Replication Single Host Commit log Memtable SSTable Indexes Compaction
  28. Memtables • In-memory representation of recently written data • When

    the table is full, it's sorted and then flushed to disk -> sstable
  29. SSTables Sorted Strings Tables • Immutable • On-disk • Sorted

    by a string key • In-memory index of elements • Binary search (in memory) to find element location • Bloom filter to reduce number of unneeded binary searches.
  30. Write Properties • No Locks in the critical path •

    Always available to writes, even if there are failures. • No reads • No seeks • Fast • Atomic within a Row
  31. Bloom Filters • Space efficient probabilistic data structure • Test

    whether an element is a member of a set • Allow false positive, but not false negative • k hash functions • Union and intersection are implemented as bitwise OR, AND
  32. Read Properteis • Read multiple SSTables • Slower than writes

    (but still fast) • Seeks can be mitigated with more RAM • Uses probabilistic bloom filters to reduce lookups. • Extensive optional caching ◦ Key Cache ◦ Row Cache ◦ Excellent monitoring
  33. Gossip •p2p •Enables seamless nodes addition. •Rebalancing of keys •Fast

    detection of nodes that goes down. •Every node knows about all others - no master.
  34. Deletions • Deletion marker (tombstone) necessary to suppress data in

    older SSTables, until compaction • Read repair complicates things a little • Eventually consistent complicates things more • Solution: configurable delay before tombstone GC, after which tombstones are not repaired
  35. Extra Long list of subjects • SEDA (Staged Events Driven

    Architecture) • Anti Entropy and Merkle Trees • Hinted Handoff • repair on read
  36. SEDA • Mutate • Stream • Gossip • Response •

    Anti Entropy • Load Balance • Migration
  37. References • http://horicky.blogspot.com/2009/11/nosql-patterns.html • http://s3.amazonaws. com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007. pdf • http://labs.google.com/papers/bigtable.html •

    http://bret.appspot.com/entry/how-friendfeed-uses-mysql • http://www.julianbrowne.com/article/viewer/brewers-cap- theorem • http://www.allthingsdistributed. com/2008/12/eventually_consistent.html • http://wiki.apache.org/cassandra/DataModel • http://incubator.apache.org/thrift/ • http://www.eecs.harvard.edu/~mdw/papers/quals-seda.pdf