Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Cassandra

Apache Cassandra

Talk for Software Craftsmanship Belarus Meetup#6.

Uladzimir Mihura

August 09, 2011
Tweet

More Decks by Uladzimir Mihura

Other Decks in Technology

Transcript

  1. What is Cassandra? • key-value store with some structure •

    fault-tolerant • scalable • eventual consistent • tunable - consistency level - replication
  2. Where did it come from? • created at Facebook -

    Dynamo: distribution architecture - BigTable: data model • open-sourced in 2008 • Apache incubator in early 2009 • graduation in March 2010
  3. Who uses it? • Facebook (of cource) • Rackspace •

    Twitter • Digg • Reddit • IBM • others...
  4. What problems does it solve? • reliability at scale -

    no single point of failure (all nodes are identical) • simple scaling (linear) • high write throughput • large data sets
  5. What problems it can’t solve? • no flexible indices (later

    about this) • not good for big binary data (>64mb) unless you chunk • row contents must fit in available memory
  6. Clustering: CAP • CAP Theorem - Consistency - Availability -

    Partition tolerance • choose two • Cassandra chooses A and P but allows them to be tunable to have more C
  7. Clustering: Replication & Consistency • replication factor - how many

    nodes data is replicated on • consistency level - zero (async write) - any - one - quorum (rf/2+1) - all
  8. zero none (async write) write any 1st response (included hinted

    handoff) write one 1st response read/write quorum rf/2 + 1 read/write all all read/write Clustering: Consistency Level
  9. Clustering: Ring • every node gets a token - defines

    its place in the ring - and which keys it is responsible for (ranges)
  10. Clustering:Ring • every node gets a token - defines its

    place in the ring - and which keys it is responsible for (ranges)
  11. Clustering:Ring • new node - token assignment - ranges adjusted

    - bootstrap - only neighbor nodes affected
  12. Data Model • keyspace • column family • row (indexed)

    • key • columns • name (sorted) • value
  13. Reading • get(): retrieve column by name • multiget(): by

    column name for a number of keys • get_slice(): by column name or a range of names - returning columns - returning supercolumns • multiget_slice(): a subset of columns for a set of keys • get_count(): number of columns or subcolumns • get_range_slice(): subset of columns for a range of keys
  14. Clients Python: • Pycassa: http://github.com/pycassa/pycassa • Telephus: http://github.com/driftx/Telephus (Twisted) •

    Java: • Hector: http://github.com/rantav/hector • Kundera http://github.com/impetus-opensource/Kundera • Pelops: http://github.com/s7/scale7-pelops • Cassandrelle (Demoiselle Cassandra): http://demoiselle.sf.net/ component/demoiselle-cassandra/ • .NET • Aquiles: http://aquiles.codeplex.com/ • Ruby: • Cassandra: http://github.com/fauna/cassandra • PHP: • PHP Client Library: https://github.com/kallaspriit/Cassandra-PHP- Client-Library • phpcassa: http://github.com/thobbs/phpcassa
  15. CQL (from 0.8) • USE • SELECT • INSERT/UPDATE •

    DELETE • TRUNCATE/DROP • BATCH • CREATE KEYSPACE • CREATE COLUMNFAMILY • CREATE INDEX
  16. CQL: Example CREATE COLUMNFAMILY users ( ... KEY varchar PRIMARY

    KEY, ... password varchar, ... gender varchar, ... session_token varchar, ... state varchar, ... birth_year bigint); INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a'); SELECT * FROM users WHERE KEY='jsmith'; u'jsmith' | u'password',u'ch@ngem3a' DROP COLUMNFAMILY users;
  17. CQL: Example CREATE INDEX birth_year_key ON users (birth_year); CREATE INDEX

    state_key ON users (state); SELECT * FROM users ... WHERE gender='f' AND ... state='TX' AND ... birth_year='1968'; u'user1' | u'birth_year',1968 | u'gender',u'f' | u'password',u'ch@ngem3' | u'state',u'TX' DROP COLUMNFAMILY users;
  18. Indexing • secondary indexes - hashed - equality predicates (where

    column x = y) - specified on creation or later - best when many rows with similar columns • self-managed indexes
  19. Indexing: Self-managed: one-to-several index name indexed value #1 indexed value

    #1 indexed value #2 indexed value #2 index name related key related key related key related key
  20. Indexing: Self-managed: one-to-many indexed value #1 related key related key

    indexed value #1 - - indexed value #2 related key related key indexed value #2 - -
  21. Indexing: Self-managed: one-to-many indexed value #1 ordering value ordering value

    indexed value #1 related key related key indexed value #2 ordering value ordering value indexed value #2 related key related key
  22. Let’s practice: Twitter Get a user record by username •

    Get the friends of a username • Get the followers of a username • Get a timeline for a user • Get a timeline of a specific user’s tweets • Get a tweet from a tweet ID • Create a tweet • Create a user • Add friends to a user • Remove friends from a user
  23. ?