Principles of Cassandra Data Modeling

©2014 DataStax Tyler Hobbs Principles of Cassandra Data Modeling 1
Cassandra Committer at DataStax [email protected] @tylhobbs Friday, May 16, 14

About this Talk This is about getting the basics right.
Friday, May 16, 14

About this Talk This is about getting the basics right.
Fancy tricks can come later, once you know how to evaluate them. Friday, May 16, 14

General Principles 1. Spread data evenly around the cluster Friday,
May 16, 14

Primary key determines placement* Partitioning jim carol johnny suzy age:
36 car: camaro gender: M age: 37 car: subaru gender: F age:12 gender: M age:10 gender: F Friday, May 16, 14

jim carol johnny suzy PK 5e02739678... a9a0198010... f4eb27cea7... 78b421309e... Murmur
Hash Murmur* hash operation yields a 64-bit number for keys of any size. Friday, May 16, 14

Node D Node C Node B Node A The “token
ring” 0xc000000000..0 0x0000000000..0 0x4000000000..0 0x8000000000..0 Friday, May 16, 14

jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End
A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14

Node D Node C Node B Node A carol a9a0198010...
Replication Friday, May 16, 14

Node D Node C Node B Node A carol a9a0198010...
Friday, May 16, 14

C’’ A’’ D’ C’ A’ D A B’ C B
Virtual nodes Node D Node C Node B Node A Without vnodes With vnodes Friday, May 16, 14

General Principles 1. Spread data evenly around the cluster Pick
a good primary key. Friday, May 16, 14

General Principles 1. Spread data evenly around the cluster 2.
Minimize the number of partitions read Friday, May 16, 14 Why minimize? Network cost of fan-out, disk seeks.

Minimize the number of partitions read Sometimes these conflict. It’s a balance. Friday, May 16, 14

Non-Goals 1. Minimize the number of writes Friday, May 16,
14

Non-Goals 1. Minimize the number of writes Writes aren’t free,
but they’re cheap. Friday, May 16, 14

Non-Goals 1. Minimize the number of writes 2. Minimize data
duplication Friday, May 16, 14

Non-Goals 1. Minimize the number of writes 2. Minimize data
duplication There are no joins. (Nor do you really want them in a distributed database.) Friday, May 16, 14 Also, disk space is a cheap resource.

Minimize the number of partitions read Minimize partition reads by modeling to fit your queries. Friday, May 16, 14

Model Your Queries 1. Don’t model around relations Friday, May
16, 14

Model Your Queries 1. Don’t model around relations 2. Don’t
model around objects Friday, May 16, 14

Model Your Queries Steps to pick a data model: 1.
Determine what specific queries to support Friday, May 16, 14

Model Your Queries Consider grouping, ordering, filtering, uniqueness, etc. Steps
to pick a data model: 1. Determine what specific queries to support Friday, May 16, 14

Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14

Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Use roughly one table per query pattern. Friday, May 16, 14

Example: User Lookup We have users and want to look
them up. Friday, May 16, 14

Example: User Lookup We have users and want to look
them up: • by username • by email • want full user details for both types of lookups Friday, May 16, 14

Example: User Lookup CREATE TABLE users_by_username ( username text PRIMARY
KEY, email text, age int ) We have users and want to look them up: • by username • by email Friday, May 16, 14

KEY, email text, age int ) CREATE TABLE users_by_email ( email text PRIMARY KEY, username text, age int ) We have users and want to look them up: • by username • by email Friday, May 16, 14

Minimize the number of partitions read Friday, May 16, 14

KEY, email text, age int ) CREATE TABLE users_by_email ( email text PRIMARY KEY, username text, age int ) Friday, May 16, 14

KEY, id uuid ) CREATE TABLE users_by_email ( email text PRIMARY KEY, id uuid ) What if we had done: CREATE TABLE users ( id uuid PRIMARY KEY username text, email text, age int, ) Friday, May 16, 14

Example: User Lookup What if we had done: CREATE TABLE
users ( id uuid PRIMARY KEY username text, email text, age int, ) We reduced data duplication! Friday, May 16, 14

users ( id uuid PRIMARY KEY username text, email text, age int, ) We reduced data duplication! Need to read twice as many partitions. Friday, May 16, 14 Same problem with secondary indexes Consider cost/frequency of updates vs reads. Worse for other patterns.

Example: User Groups Users are in groups. Friday, May 16,
14

Example: User Groups Users are in groups. We want to
get all users in a group. Friday, May 16, 14

Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition How do we fit a group in a partition? Friday, May 16, 14

Example: User Groups CREATE TABLE groups ( groupname text, username
text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14

Example: User Groups groupname is the partitioning key username is
the clustering key CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14

Example: User Groups One partition per group. Within a partition,
order rows by username. CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14

Example: User Groups One partition per group. Not spread perfectly,
but reads are great. CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14 How to break up big groups? Split by time, username preﬁx, or preﬁx of a hash of the username.

Example: User Groups We’re duplicating user data many times. CREATE
TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14

users ( id uuid PRIMARY KEY username text, email text, age int ) CREATE TABLE groups ( groupname text, user_id uuid, PRIMARY KEY (groupname, user_id) ) Friday, May 16, 14 What’s wrong with this?

users ( id uuid PRIMARY KEY username text, email text, age int ) CREATE TABLE groups ( groupname text, user_id uuid, PRIMARY KEY (groupname, user_id) ) Read wayyyy too many partitions. Friday, May 16, 14

Example: New Users Users are in groups. Get the N
newest users in a group. Friday, May 16, 14

Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined
timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) Friday, May 16, 14

timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) Use timeuuid instead of timestamp to avoid collisions. Friday, May 16, 14

timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) WITH CLUSTERING ORDER BY (joined DESC) Reverse clustering order to optimize for N latest. Friday, May 16, 14

timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) WITH CLUSTERING ORDER BY (joined DESC) SELECT * FROM group_join_dates WHERE groupname=? LIMIT ? Friday, May 16, 14

timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) How to spread data more evenly? Friday, May 16, 14

Example: User Groups CREATE TABLE group_join_dates ( groupname text, join_day
text, joined timeuuid, username text, email text, age int, PRIMARY KEY ((groupname, join_day), joined) ) WITH CLUSTERING ORDER BY (joined DESC) Use a compound partition key. Friday, May 16, 14

Example: User Groups CREATE TABLE group_join_dates ( groupname text, join_day
text, joined timeuuid, username text, email text, age int, PRIMARY KEY ((groupname, join_day), joined) ) WITH CLUSTERING ORDER BY (joined DESC) Use a compound partition key. join_day looks like ‘2014-05-15’ Friday, May 16, 14 Can also use second, month, year, etc.

Minimize the number of partitions read We might a few read more partitions. It’s a balance. Friday, May 16, 14

Getting Fancy Use fancy features to help satisfy queries by
reading from single partition. Friday, May 16, 14

reading from single partition. • List, Set, and Map collections Friday, May 16, 14

reading from single partition. • List, Set, and Map collections CREATE TABLE songs ( artist text, album text, song text, tags set<text>, PRIMARY KEY ((artist, album, song)) ) Friday, May 16, 14 Alternative model would be song <-> tags, requires more reads.

Getting Fancy • Alternative to set of tags: CREATE TABLE
songs ( artist text, album text, song text, PRIMARY KEY ((artist, album, song)) ) CREATE TABLE tags ( artist text, album text, song text, tag text, PRIMARY KEY ((artist, album, song), tag) ) Friday, May 16, 14 Requires more partition reads

reading from single partition. • List, Set, and Map collections • User defined types Friday, May 16, 14 New in 2.1

reading from single partition. • List, Set, and Map collections • User defined types CREATE TYPE address ( street text, city text, zip int ) CREATE TABLE users ( username text PRIMARY KEY, email text, addresses map<text, address> ) Friday, May 16, 14 Support “residential”, “business”, and other addresses. Alternative is separate table for user addresses

reading from single partition. • List, Set, and Map collections • User defined types • Static Columns Friday, May 16, 14 New in 2.1

reading from single partition. • List, Set, and Map collections • User defined types • Static Columns CREATE TABLE post_and_comments ( post_id text, post_body text static, comment_time timeuuid, commenter text, comment_body text, PRIMARY KEY (post_id, comment_time) ) Friday, May 16, 14 Always get post body with comments

Getting Fancy • Static Columns post_id | post_body | comment_time
| comment ------------------------------------------------- 3 | Lorem ... | 1 | Hey, ... 3 | | 2 | What’s ... 3 | | 3 | Foobar 1 | How to ... | 2 | Barfoo 2 | Four score | 1 | I <3 C* 2 | | 2 | How does ... Friday, May 16, 14 Always get post body with comments

Questions? @tylhobbs [email protected] Friday, May 16, 14

Bonus: Transactions INSERT INTO users (username, email, age) VALUES (?,
?, ?) IF NOT EXISTS Friday, May 16, 14

Bonus: Transactions INSERT INTO users (username, email, age) VALUES (?,
?, ?) IF NOT EXISTS Use for the 1% of your app that requires them. Friday, May 16, 14

Principles of Cassandra Data Modeling

Principles of Cassandra Data Modeling

More Decks by Tyler Hobbs

Other Decks in Technology

Featured

Transcript