Cassandra / Read & Writes

Slide 1

Slide 1 text

Nicolas DOUILLET / Lead Dev @Batch / @mini_marcel Cassandra Read and Write requests 1

Slide 2

Slide 2 text

Basics : key/value ❖ Cassandra is a key/value store ❖ Write requests concern ONE partition key ❖ Read requests mainly concern ONE partition key ❖ You (almost) can’t ﬁlter on other columns ❖ No joins, no constraints …

Slide 3

Slide 3 text

Basics : Schema ❖ A “database” is called a Keyspace ❖ A “table” is called a ColumnFamily   (with cql3 we usually use the term table) ❖ Two primary keys : ❖ the Partition Key (let’s consider only this one for now)  The only way to access your data ❖ the Cluster Key (we’ll talk about it later) ❖ A ColumnFamily contains a list of Rows ❖ A Row contains a list of Columns ❖ In other terms : Map>

Slide 4

Slide 4 text

Basics : Repartition ❖ The Ring is divided in VNodes ❖ Each VNode is a range of Tokens ❖ Each physical Node obtains a list of VNodes ❖ Placement of a row in the Ring determined by the hash of the PartitionKey  (Using the Partitioner)

Slide 5

Slide 5 text

Basics : Replication ❖ Replication Factor (RF) ❖ The RF is deﬁned per Keyspace ❖ 3 replication strategies :   Local, Simple, NetworkTopology ❖ RF=3, the data are replicated on 3 nodes ❖ 4 nodes and RF=3, each node owns 75% CREATE KEYSPACE myKeyspace WITH replication =   {'class': 'SimpleStrategy', 'replication_factor': ‘3'};

Slide 6

Slide 6 text

Basics : Consistency ❖ Cassandra is AP   (according to the CAP theorem) ❖ Cassandra offers a tunable Consistency  (Tradeoff between consistency and latency) ❖ The Consistency Level is deﬁned on each request ❖ Levels : Any (for writes only), One, Two, Three, All and  Quorum = RF/2 + 1 ❖ RF=3 => 2 replicas involved in the query in Quorum

Slide 7

Slide 7 text

Basics : Simple tables (with cql3) CREATE TABLE user ( id uuid,  firstname text,  lastname text,  birthdate timestamp,  PRIMARY KEY (id) ); CREATE TABLE user (…)  WITH bloom_filter_fp_chance = 0.1  AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'  AND compaction = {'min_threshold':'4', 'class':'LeveledCompactionStrategy', 'max_threshold': '32'}  AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}  AND read_repair_chance = 0.0; CREATE TABLE device ( id uuid,  platform text,  lastvisit timestamp,  PRIMARY KEY ((id,platform)) ); Table parameters PartitionKey

Slide 8

Slide 8 text

Basics : Wide Rows (with cql3) ❖ Store on one PartitionKey a set of records ❖ ClusterKey, part of the PrimaryKey ❖ CQL3 hides the wide row mechanism :   you keep working on records. ❖ ClusterKey accepts Comparison operators  (Unlike the PartitionKey that allows you only equality) ❖ ClusterKey accepts Ordering ❖ Huge number of rows (2 Billions)

Slide 9

Slide 9 text

Basics : ClusterKey (with cql3) CREATE TABLE user_device ( userid uuid, deviceid uuid, platform text,  last_use timestamp,  version string,  PRIMARY KEY ((userid), deviceid, platform) ); How values are stored PartitionKey ClusterKey userid 1:ANDROID: 1:ANDROID:last_use 1:IOS: 1:IOS:last_use 1:IOS:version 1 2015-09-16 18:56:00Z 2015-08-02 15:20:00Z 8.3 3:ANDROID: 3:ANDROID:last_use 3:ANDROID:version 2 2015-09-10 22:03:00Z 4.4 … PartitionKey ClusterKey Composite Column

Slide 10

Slide 10 text

Writes N4 ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum N5 N1 N2 N3

Slide 11

Slide 11 text

Writes : Coordinator N4 The client chooses a coordinator. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 12

Slide 12 text

Writes : Replicas N4 The coordinator determines the alive replicas involved.  By hashing the Key with the Partitioner. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 13

Slide 13 text

Writes : Perform Write N4 The coordinator sends a Write request on each replica. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 14

Slide 14 text

Writes : Wait for response N4 The coordinator waits until enough nodes have responded.    According to the CL, 2 in this example. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 15

Slide 15 text

Writes : Response N4 Finally, the coordinator gives the response back to the client. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 16

Slide 16 text

Writes : on a node ❖ The Structure is updated in memory : Memtable ❖ In the mean time, the write is appended to the Commit log  (durability) ❖ A timestamp is associated to each updated column ❖ The Memtable is ﬂushed on the disk : SSTable ❖ Data is the Commit log is purged. During the Write Asynchronously When the contents exceed a threshold

Slide 17

Slide 17 text

Writes : SSTable ❖ Data File : ❖ Key/Values sorted by keys  Sorted Strings Table ❖ Index File : ❖ (Key, offset) pairs ❖ Bloom Filter ❖ May contain only parts of the data of a CF or a key ❖ Concerns one ColumnFamily ❖ Multiple SSTables for one ColumnFamily ❖ Immutable

Slide 18

Slide 18 text

More on Writes ❖ Writes are really fast ❖ Cassandra doesn’t check if the row already exists ❖ Inserts and updates are almost similar : upsert ❖ A delete creates a tombstone ❖ Cassandra supports Batches ❖ Batches can be atomic

Slide 19

Slide 19 text

Writes : concerns ❖ Repartition on nodes (uuid and tuuid) ❖ Hot spots ❖ Choose RF according to the consistency ❖ Prefer idempotent inserts/updates ❖ Use timestamps on queries ❖ Prefer concurrent writes to a batch ❖ Trust your cluster  (low consistency is most of the time sufﬁcient)

Slide 20

Slide 20 text

Reads N4 ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum N5 N1 N2 N3

Slide 21

Slide 21 text

Reads : Coordinator N4 The client chooses a coordinator. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 22

Slide 22 text

Reads : Replicas N4 The coordinator determines the alive replicas involved, sorted by proximity. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 23

Slide 23 text

Reads : Replicas N4 The coordinator determines replicas to repair : read repair.   According to the repair chance, on the ColumnFamily. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 24

Slide 24 text

Reads : First pass N4 The coordinator sends a Read Request and Digest Requests. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest

Slide 25

Slide 25 text

Reads : Wait for responses N4 The coordinator waits until enough nodes have responded.    According to the CL, and if data is present. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest

Slide 26

Slide 26 text

Reads : Second pass N4 If digest doesn’t match, the coordinator sends a Read Request to all replicas. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read

Slide 27

Slide 27 text

Reads : Wait for responses (again) N4 The coordinator waits until enough nodes have responded.    According to the CL, 2 in this example. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read

Slide 28

Slide 28 text

Reads : Resolve N4 The coordinator resolves the data and sends Repair Requests to outdated nodes. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Repair Repair Repair

Slide 29

Slide 29 text

Reads : Response N4 Finally the coordinator gives the response back to the client. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Slide 30

Slide 30 text

Reads : on a node ❖ Search in SSTables ❖ Use the bloom filter to filter SSTables ❖ Use index to find row, and columns ❖ Read & Merge data ❖ Search in memory ❖ The Memtable contains the most recent data   (updated lately) First Step Second Step Note ❖ A KeyCache is used to speed up column index seek ❖ A RowCache with frequently asked partitions can be used

Slide 31

Slide 31 text

How SSTables affect Latency

Slide 32

Slide 32 text

Compaction ❖ Reduce the number of SSTables by merging ❖ Mainly processed in background (SSTables immutable) ❖ 3 built-in strategies : ❖ Size-Tiered (STCS) ❖ Leveled (LCS) ❖ Date-Tiered (DTCS)

Slide 33

Slide 33 text

Compaction : Size-Tiered ❖ From Google BigTable Paper ❖ Merge similar-sized SSTables  (when >4 by default) ❖ Would need 100% much free space during compaction ❖ No guarantees as to how many SSTables are read for one row ❖ Useful for : ❖ Write-heavy Workloads ❖ Rows Are Write-Once

Slide 34

Slide 34 text

Compaction : Leveled ❖ Based on LevelDB  From the Chromium team at Google ❖ Fixed small-sized SSTables (5MB by default) ❖ Non-overlapping within each level ❖ Read from a single SSTable 90% of the time ❖ Twice as much I/O as Size-Tiered ❖ Useful for : ❖ Rows Are Frequently Updated ❖ High Sensitivity to Read Latency ❖ Deletions or TTLed Columns in Wide Rows

Slide 35

Slide 35 text

Compaction : Date-Tiered ❖ By Björn Hegerfors at Spotify ❖ Designed for time series-like data ❖ The only strategy based on the data aged ❖ Group SSTables in windows based on how old the data is ❖ Each SSTable is 4 (by default) times the size of the previous window ❖ If using TTL, one entire SSTable can be dropped ❖ Written data time should be continuous : can write old data

Slide 36

Slide 36 text

Read Concerns ❖ Models your tables for reads  (and stay concerned by writes) ❖ Use Denormalization ❖ Choose your primary keys ❖ Keep number of SSTables low  (choose your compaction strategy) ❖ Use high consistency if really needed

Slide 37

Slide 37 text

Nicolas DOUILLET / Lead Dev @Batch / @mini_marcel Thanks!!! Cassandra Read & Write Requests