Pro Yearly is on sale from $80 to $50! »

Principles of Cassandra Data Modeling

Principles of Cassandra Data Modeling

This talk covers the fundamentals of designing your Cassandra data model. Follow these basic guidelines when designing your schema.

9887d7469053de692ac3ac1fa1f9892a?s=128

Tyler Hobbs

May 15, 2014
Tweet

Transcript

  1. ©2014 DataStax Tyler Hobbs Principles of Cassandra Data Modeling 1

    Cassandra Committer at DataStax tyler@datastax.com @tylhobbs Friday, May 16, 14
  2. About this Talk This is about getting the basics right.

    Friday, May 16, 14
  3. About this Talk This is about getting the basics right.

    Fancy tricks can come later, once you know how to evaluate them. Friday, May 16, 14
  4. General Principles 1. Spread data evenly around the cluster Friday,

    May 16, 14
  5. Primary key determines placement* Partitioning jim carol johnny suzy age:

    36 car: camaro gender: M age: 37 car: subaru gender: F age:12 gender: M age:10 gender: F Friday, May 16, 14
  6. jim carol johnny suzy PK 5e02739678... a9a0198010... f4eb27cea7... 78b421309e... Murmur

    Hash Murmur* hash operation yields a 64-bit number for keys of any size. Friday, May 16, 14
  7. Node D Node C Node B Node A The “token

    ring” 0xc000000000..0 0x0000000000..0 0x4000000000..0 0x8000000000..0 Friday, May 16, 14
  8. jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End

    A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14
  9. jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End

    A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14
  10. jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End

    A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14
  11. jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End

    A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14
  12. jim 5e02739678... carol a9a0198010... johnny f4eb27cea7... suzy 78b421309e... Start End

    A 0x0000000000..1 0x4000000000..0 B 0x4000000000..1 0x8000000000..0 C 0x8000000000..1 0xc000000000..0 D 0xc000000000..1 0x0000000000..0 Friday, May 16, 14
  13. Node D Node C Node B Node A carol a9a0198010...

    Replication Friday, May 16, 14
  14. Node D Node C Node B Node A carol a9a0198010...

    Friday, May 16, 14
  15. Node D Node C Node B Node A carol a9a0198010...

    Friday, May 16, 14
  16. C’’ A’’ D’ C’ A’ D A B’ C B

    Virtual nodes Node D Node C Node B Node A Without vnodes With vnodes Friday, May 16, 14
  17. General Principles 1. Spread data evenly around the cluster Pick

    a good primary key. Friday, May 16, 14
  18. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Friday, May 16, 14 Why minimize? Network cost of fan-out, disk seeks.
  19. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Sometimes these conflict. It’s a balance. Friday, May 16, 14
  20. Non-Goals 1. Minimize the number of writes Friday, May 16,

    14
  21. Non-Goals 1. Minimize the number of writes Writes aren’t free,

    but they’re cheap. Friday, May 16, 14
  22. Non-Goals 1. Minimize the number of writes 2. Minimize data

    duplication Friday, May 16, 14
  23. Non-Goals 1. Minimize the number of writes 2. Minimize data

    duplication There are no joins. (Nor do you really want them in a distributed database.) Friday, May 16, 14 Also, disk space is a cheap resource.
  24. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Minimize partition reads by modeling to fit your queries. Friday, May 16, 14
  25. Model Your Queries 1. Don’t model around relations Friday, May

    16, 14
  26. Model Your Queries 1. Don’t model around relations 2. Don’t

    model around objects Friday, May 16, 14
  27. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support Friday, May 16, 14
  28. Model Your Queries Consider grouping, ordering, filtering, uniqueness, etc. Steps

    to pick a data model: 1. Determine what specific queries to support Friday, May 16, 14
  29. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14
  30. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Use roughly one table per query pattern. Friday, May 16, 14
  31. Example: User Lookup We have users and want to look

    them up. Friday, May 16, 14
  32. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14
  33. Example: User Lookup We have users and want to look

    them up: • by username • by email • want full user details for both types of lookups Friday, May 16, 14
  34. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14
  35. Example: User Lookup CREATE TABLE users_by_username ( username text PRIMARY

    KEY, email text, age int ) We have users and want to look them up: • by username • by email Friday, May 16, 14
  36. Example: User Lookup CREATE TABLE users_by_username ( username text PRIMARY

    KEY, email text, age int ) CREATE TABLE users_by_email ( email text PRIMARY KEY, username text, age int ) We have users and want to look them up: • by username • by email Friday, May 16, 14
  37. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Friday, May 16, 14
  38. Example: User Lookup CREATE TABLE users_by_username ( username text PRIMARY

    KEY, email text, age int ) CREATE TABLE users_by_email ( email text PRIMARY KEY, username text, age int ) Friday, May 16, 14
  39. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Friday, May 16, 14
  40. Example: User Lookup CREATE TABLE users_by_username ( username text PRIMARY

    KEY, id uuid ) CREATE TABLE users_by_email ( email text PRIMARY KEY, id uuid ) What if we had done: CREATE TABLE users ( id uuid PRIMARY KEY username text, email text, age int, ) Friday, May 16, 14
  41. Example: User Lookup What if we had done: CREATE TABLE

    users ( id uuid PRIMARY KEY username text, email text, age int, ) We reduced data duplication! Friday, May 16, 14
  42. Example: User Lookup What if we had done: CREATE TABLE

    users ( id uuid PRIMARY KEY username text, email text, age int, ) We reduced data duplication! Need to read twice as many partitions. Friday, May 16, 14 Same problem with secondary indexes Consider cost/frequency of updates vs reads. Worse for other patterns.
  43. Example: User Groups Users are in groups. Friday, May 16,

    14
  44. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14
  45. Example: User Groups Users are in groups. We want to

    get all users in a group. Friday, May 16, 14
  46. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition Friday, May 16, 14
  47. Model Your Queries Steps to pick a data model: 1.

    Determine what specific queries to support 2. Try to create a table where you can satisfy your query by reading (roughly) one partition How do we fit a group in a partition? Friday, May 16, 14
  48. Example: User Groups CREATE TABLE groups ( groupname text, username

    text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14
  49. Example: User Groups groupname is the partitioning key username is

    the clustering key CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14
  50. Example: User Groups One partition per group. Within a partition,

    order rows by username. CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14
  51. Example: User Groups One partition per group. Within a partition,

    order rows by username. groupname | username | email | age -------------------------------------- group3 | alice | alice@... | 42 group3 | bob | bob@... | 18 group3 | charles | chuck@... | 37 group1 | mike | mike@... | 14 group2 | dave | dave@... | 63 group2 | erin | erin@... | 21 Friday, May 16, 14
  52. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Friday, May 16, 14
  53. Example: User Groups One partition per group. Not spread perfectly,

    but reads are great. CREATE TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14 How to break up big groups? Split by time, username prefix, or prefix of a hash of the username.
  54. Example: User Groups We’re duplicating user data many times. CREATE

    TABLE groups ( groupname text, username text, email text, age int, PRIMARY KEY (groupname, username) ) Friday, May 16, 14
  55. Example: User Lookup What if we had done: CREATE TABLE

    users ( id uuid PRIMARY KEY username text, email text, age int ) CREATE TABLE groups ( groupname text, user_id uuid, PRIMARY KEY (groupname, user_id) ) Friday, May 16, 14 What’s wrong with this?
  56. Example: User Lookup What if we had done: CREATE TABLE

    users ( id uuid PRIMARY KEY username text, email text, age int ) CREATE TABLE groups ( groupname text, user_id uuid, PRIMARY KEY (groupname, user_id) ) Read wayyyy too many partitions. Friday, May 16, 14
  57. Example: New Users Users are in groups. Get the N

    newest users in a group. Friday, May 16, 14
  58. Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined

    timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) Friday, May 16, 14
  59. Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined

    timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) Use timeuuid instead of timestamp to avoid collisions. Friday, May 16, 14
  60. Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined

    timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) WITH CLUSTERING ORDER BY (joined DESC) Reverse clustering order to optimize for N latest. Friday, May 16, 14
  61. Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined

    timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) WITH CLUSTERING ORDER BY (joined DESC) SELECT * FROM group_join_dates WHERE groupname=? LIMIT ? Friday, May 16, 14
  62. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read Friday, May 16, 14
  63. Example: User Groups CREATE TABLE group_join_dates ( groupname text, joined

    timeuuid, username text, email text, age int, PRIMARY KEY (groupname, joined) ) How to spread data more evenly? Friday, May 16, 14
  64. Example: User Groups CREATE TABLE group_join_dates ( groupname text, join_day

    text, joined timeuuid, username text, email text, age int, PRIMARY KEY ((groupname, join_day), joined) ) WITH CLUSTERING ORDER BY (joined DESC) Use a compound partition key. Friday, May 16, 14
  65. Example: User Groups CREATE TABLE group_join_dates ( groupname text, join_day

    text, joined timeuuid, username text, email text, age int, PRIMARY KEY ((groupname, join_day), joined) ) WITH CLUSTERING ORDER BY (joined DESC) Use a compound partition key. join_day looks like ‘2014-05-15’ Friday, May 16, 14 Can also use second, month, year, etc.
  66. General Principles 1. Spread data evenly around the cluster 2.

    Minimize the number of partitions read We might a few read more partitions. It’s a balance. Friday, May 16, 14
  67. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. Friday, May 16, 14
  68. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections Friday, May 16, 14
  69. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections CREATE TABLE songs ( artist text, album text, song text, tags set<text>, PRIMARY KEY ((artist, album, song)) ) Friday, May 16, 14 Alternative model would be song <-> tags, requires more reads.
  70. Getting Fancy • Alternative to set of tags: CREATE TABLE

    songs ( artist text, album text, song text, PRIMARY KEY ((artist, album, song)) ) CREATE TABLE tags ( artist text, album text, song text, tag text, PRIMARY KEY ((artist, album, song), tag) ) Friday, May 16, 14 Requires more partition reads
  71. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections • User defined types Friday, May 16, 14 New in 2.1
  72. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections • User defined types CREATE TYPE address ( street text, city text, zip int ) CREATE TABLE users ( username text PRIMARY KEY, email text, addresses map<text, address> ) Friday, May 16, 14 Support “residential”, “business”, and other addresses. Alternative is separate table for user addresses
  73. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections • User defined types • Static Columns Friday, May 16, 14 New in 2.1
  74. Getting Fancy Use fancy features to help satisfy queries by

    reading from single partition. • List, Set, and Map collections • User defined types • Static Columns CREATE TABLE post_and_comments ( post_id text, post_body text static, comment_time timeuuid, commenter text, comment_body text, PRIMARY KEY (post_id, comment_time) ) Friday, May 16, 14 Always get post body with comments
  75. Getting Fancy • Static Columns post_id | post_body | comment_time

    | comment ------------------------------------------------- 3 | Lorem ... | 1 | Hey, ... 3 | Lorem ... | 2 | What’s ... 3 | Lorem ... | 3 | Foobar 1 | How to ... | 2 | Barfoo 2 | Four score | 1 | I <3 C* 2 | Four score | 2 | How does ... Friday, May 16, 14 Always get post body with comments
  76. Getting Fancy • Static Columns post_id | post_body | comment_time

    | comment ------------------------------------------------- 3 | Lorem ... | 1 | Hey, ... 3 | | 2 | What’s ... 3 | | 3 | Foobar 1 | How to ... | 2 | Barfoo 2 | Four score | 1 | I <3 C* 2 | | 2 | How does ... Friday, May 16, 14 Always get post body with comments
  77. Questions? @tylhobbs tyler@datastax.com Friday, May 16, 14

  78. Bonus: Transactions INSERT INTO users (username, email, age) VALUES (?,

    ?, ?) IF NOT EXISTS Friday, May 16, 14
  79. Bonus: Transactions INSERT INTO users (username, email, age) VALUES (?,

    ?, ?) IF NOT EXISTS Use for the 1% of your app that requires them. Friday, May 16, 14