Apache Cassandra

Apache Cassandra Vova Miguro [email protected] THE END

What is Cassandra? • key-value store with some structure •
fault-tolerant • scalable • eventual consistent • tunable - consistency level - replication

Where did it come from? • created at Facebook -
Dynamo: distribution architecture - BigTable: data model • open-sourced in 2008 • Apache incubator in early 2009 • graduation in March 2010

Who uses it? • Facebook (of cource) • Rackspace •
Twitter • Digg • Reddit • IBM • others...

What problems does it solve? • reliability at scale -
no single point of failure (all nodes are identical) • simple scaling (linear) • high write throughput • large data sets

What problems it can’t solve? • no ﬂexible indices (later
about this) • not good for big binary data (>64mb) unless you chunk • row contents must ﬁt in available memory

Clustering: CAP • CAP Theorem - Consistency - Availability -
Partition tolerance • choose two • Cassandra chooses A and P but allows them to be tunable to have more C

Clustering: Replication & Consistency • replication factor - how many
nodes data is replicated on • consistency level - zero (async write) - any - one - quorum (rf/2+1) - all

zero none (async write) write any 1st response (included hinted
handoff) write one 1st response read/write quorum rf/2 + 1 read/write all all read/write Clustering: Consistency Level

Clustering: Ring • every node gets a token - deﬁnes
its place in the ring - and which keys it is responsible for (ranges)

Clustering:Ring • every node gets a token - deﬁnes its
place in the ring - and which keys it is responsible for (ranges)

Clustering:Ring • new node - token assignment - ranges adjusted
- bootstrap - only neighbor nodes affected

Clustering:Ring • node dies or becomes isolated • hinting handoff

Data Model • keyspace • column family • row (indexed)
• key • columns • name (sorted) • value

Data Model: ColumnFamily Column families

Data Model: SuperColumnFamily Supercolumn families

Easier to start from the bottom up

Data Model: Column

Data Model: Row

Data Model: Column comparators • TimeUUID • LexicalUUID • UTF8
• Long • Bytes • ...

Data Model: ColumnFamily

Writing • simple: put(key,col,value) • complex: put(key,[col,value,...col,value]) • batch: multi
key

Writes Writing

Reading • get(): retrieve column by name • multiget(): by
column name for a number of keys • get_slice(): by column name or a range of names - returning columns - returning supercolumns • multiget_slice(): a subset of columns for a set of keys • get_count(): number of columns or subcolumns • get_range_slice(): subset of columns for a range of keys

Reads Reading

Clients Python: • Pycassa: http://github.com/pycassa/pycassa • Telephus: http://github.com/driftx/Telephus (Twisted) •
Java: • Hector: http://github.com/rantav/hector • Kundera http://github.com/impetus-opensource/Kundera • Pelops: http://github.com/s7/scale7-pelops • Cassandrelle (Demoiselle Cassandra): http://demoiselle.sf.net/ component/demoiselle-cassandra/ • .NET • Aquiles: http://aquiles.codeplex.com/ • Ruby: • Cassandra: http://github.com/fauna/cassandra • PHP: • PHP Client Library: https://github.com/kallaspriit/Cassandra-PHP- Client-Library • phpcassa: http://github.com/thobbs/phpcassa

CQL (from 0.8) • USE • SELECT • INSERT/UPDATE •
DELETE • TRUNCATE/DROP • BATCH • CREATE KEYSPACE • CREATE COLUMNFAMILY • CREATE INDEX

CQL: Example CREATE COLUMNFAMILY users ( ... KEY varchar PRIMARY
KEY, ... password varchar, ... gender varchar, ... session_token varchar, ... state varchar, ... birth_year bigint); INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a'); SELECT * FROM users WHERE KEY='jsmith'; u'jsmith' | u'password',u'ch@ngem3a' DROP COLUMNFAMILY users;

CQL: Example CREATE INDEX birth_year_key ON users (birth_year); CREATE INDEX
state_key ON users (state); SELECT * FROM users ... WHERE gender='f' AND ... state='TX' AND ... birth_year='1968'; u'user1' | u'birth_year',1968 | u'gender',u'f' | u'password',u'ch@ngem3' | u'state',u'TX' DROP COLUMNFAMILY users;

Indexing • secondary indexes - hashed - equality predicates (where
column x = y) - speciﬁed on creation or later - best when many rows with similar columns • self-managed indexes

Indexing: Self-managed: one-to-one index name indexed value #1 indexed value
#2 index name related key related key

Indexing: Self-managed: one-to-several index name indexed value #1 indexed value
#1 indexed value #2 indexed value #2 index name related key related key related key related key

Indexing: Self-managed: one-to-many indexed value #1 related key related key
indexed value #1 - - indexed value #2 related key related key indexed value #2 - -

Indexing: Self-managed: one-to-many indexed value #1 ordering value ordering value
indexed value #1 related key related key indexed value #2 ordering value ordering value indexed value #2 related key related key

Let’s practice: Twitter Get a user record by username •
Get the friends of a username • Get the followers of a username • Get a timeline for a user • Get a timeline of a specific user’s tweets • Get a tweet from a tweet ID • Create a tweet • Create a user • Add friends to a user • Remove friends from a user

Facebook messaging

Apache Cassandra

Apache Cassandra

More Decks by Uladzimir Mihura

Other Decks in Technology

Featured

Transcript