Cassandra / Read & Writes

Nicolas DOUILLET / Lead Dev @Batch / @mini_marcel Cassandra Read
and Write requests 1

Basics : key/value ❖ Cassandra is a key/value store ❖
Write requests concern ONE partition key ❖ Read requests mainly concern ONE partition key ❖ You (almost) can’t ﬁlter on other columns ❖ No joins, no constraints …

Basics : Schema ❖ A “database” is called a Keyspace
❖ A “table” is called a ColumnFamily   (with cql3 we usually use the term table) ❖ Two primary keys : ❖ the Partition Key (let’s consider only this one for now)  The only way to access your data ❖ the Cluster Key (we’ll talk about it later) ❖ A ColumnFamily contains a list of Rows ❖ A Row contains a list of Columns ❖ In other terms : Map<String,Map<String,?>>

Basics : Repartition ❖ The Ring is divided in VNodes
❖ Each VNode is a range of Tokens ❖ Each physical Node obtains a list of VNodes ❖ Placement of a row in the Ring determined by the hash of the PartitionKey  (Using the Partitioner)

Basics : Replication ❖ Replication Factor (RF) ❖ The RF
is deﬁned per Keyspace ❖ 3 replication strategies :   Local, Simple, NetworkTopology ❖ RF=3, the data are replicated on 3 nodes ❖ 4 nodes and RF=3, each node owns 75% CREATE KEYSPACE myKeyspace WITH replication =   {'class': 'SimpleStrategy', 'replication_factor': ‘3'};

Basics : Consistency ❖ Cassandra is AP   (according to
the CAP theorem) ❖ Cassandra offers a tunable Consistency  (Tradeoff between consistency and latency) ❖ The Consistency Level is deﬁned on each request ❖ Levels : Any (for writes only), One, Two, Three, All and  Quorum = RF/2 + 1 ❖ RF=3 => 2 replicas involved in the query in Quorum

Basics : Simple tables (with cql3) CREATE TABLE user (
id uuid,  firstname text,  lastname text,  birthdate timestamp,  PRIMARY KEY (id) ); CREATE TABLE user (…)  WITH bloom_filter_fp_chance = 0.1  AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'  AND compaction = {'min_threshold':'4', 'class':'LeveledCompactionStrategy', 'max_threshold': '32'}  AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}  AND read_repair_chance = 0.0; CREATE TABLE device ( id uuid,  platform text,  lastvisit timestamp,  PRIMARY KEY ((id,platform)) ); Table parameters PartitionKey

Basics : Wide Rows (with cql3) ❖ Store on one
PartitionKey a set of records ❖ ClusterKey, part of the PrimaryKey ❖ CQL3 hides the wide row mechanism :   you keep working on records. ❖ ClusterKey accepts Comparison operators  (Unlike the PartitionKey that allows you only equality) ❖ ClusterKey accepts Ordering ❖ Huge number of rows (2 Billions)

Basics : ClusterKey (with cql3) CREATE TABLE user_device ( userid
uuid, deviceid uuid, platform text,  last_use timestamp,  version string,  PRIMARY KEY ((userid), deviceid, platform) ); How values are stored PartitionKey ClusterKey userid 1:ANDROID: 1:ANDROID:last_use 1:IOS: 1:IOS:last_use 1:IOS:version 1 2015-09-16 18:56:00Z 2015-08-02 15:20:00Z 8.3 3:ANDROID: 3:ANDROID:last_use 3:ANDROID:version 2 2015-09-10 22:03:00Z 4.4 … PartitionKey ClusterKey Composite Column

Writes N4 ❖ 5 nodes ❖ RF=3 ❖ CL =
Quorum N5 N1 N2 N3

Writes : Coordinator N4 The client chooses a coordinator. N5
Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Writes : Replicas N4 The coordinator determines the alive replicas
involved.  By hashing the Key with the Partitioner. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Writes : Perform Write N4 The coordinator sends a Write
request on each replica. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Writes : Wait for response N4 The coordinator waits until
enough nodes have responded.    According to the CL, 2 in this example. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Writes : Response N4 Finally, the coordinator gives the response
back to the client. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Writes : on a node ❖ The Structure is updated
in memory : Memtable ❖ In the mean time, the write is appended to the Commit log  (durability) ❖ A timestamp is associated to each updated column ❖ The Memtable is ﬂushed on the disk : SSTable ❖ Data is the Commit log is purged. During the Write Asynchronously When the contents exceed a threshold

Writes : SSTable ❖ Data File : ❖ Key/Values sorted
by keys  Sorted Strings Table ❖ Index File : ❖ (Key, offset) pairs ❖ Bloom Filter ❖ May contain only parts of the data of a CF or a key ❖ Concerns one ColumnFamily ❖ Multiple SSTables for one ColumnFamily ❖ Immutable

More on Writes ❖ Writes are really fast ❖ Cassandra
doesn’t check if the row already exists ❖ Inserts and updates are almost similar : upsert ❖ A delete creates a tombstone ❖ Cassandra supports Batches ❖ Batches can be atomic

Writes : concerns ❖ Repartition on nodes (uuid and tuuid)
❖ Hot spots ❖ Choose RF according to the consistency ❖ Prefer idempotent inserts/updates ❖ Use timestamps on queries ❖ Prefer concurrent writes to a batch ❖ Trust your cluster  (low consistency is most of the time sufﬁcient)

Reads N4 ❖ 5 nodes ❖ RF=3 ❖ CL =
Quorum N5 N1 N2 N3

Reads : Coordinator N4 The client chooses a coordinator. N5
N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Reads : Replicas N4 The coordinator determines the alive replicas
involved, sorted by proximity. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Reads : Replicas N4 The coordinator determines replicas to repair
: read repair.   According to the repair chance, on the ColumnFamily. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Reads : First pass N4 The coordinator sends a Read
Request and Digest Requests. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest

Reads : Wait for responses N4 The coordinator waits until
enough nodes have responded.    According to the CL, and if data is present. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest

Reads : Second pass N4 If digest doesn’t match, the
coordinator sends a Read Request to all replicas. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read

Reads : Wait for responses (again) N4 The coordinator waits
until enough nodes have responded.    According to the CL, 2 in this example. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read

Reads : Resolve N4 The coordinator resolves the data and
sends Repair Requests to outdated nodes. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Repair Repair Repair

Reads : Response N4 Finally the coordinator gives the response
back to the client. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum

Reads : on a node ❖ Search in SSTables ❖
Use the bloom filter to filter SSTables ❖ Use index to find row, and columns ❖ Read & Merge data ❖ Search in memory ❖ The Memtable contains the most recent data   (updated lately) First Step Second Step Note ❖ A KeyCache is used to speed up column index seek ❖ A RowCache with frequently asked partitions can be used

How SSTables affect Latency

Compaction ❖ Reduce the number of SSTables by merging ❖
Mainly processed in background (SSTables immutable) ❖ 3 built-in strategies : ❖ Size-Tiered (STCS) ❖ Leveled (LCS) ❖ Date-Tiered (DTCS)

Compaction : Size-Tiered ❖ From Google BigTable Paper ❖ Merge
similar-sized SSTables  (when >4 by default) ❖ Would need 100% much free space during compaction ❖ No guarantees as to how many SSTables are read for one row ❖ Useful for : ❖ Write-heavy Workloads ❖ Rows Are Write-Once

Compaction : Leveled ❖ Based on LevelDB  From the Chromium
team at Google ❖ Fixed small-sized SSTables (5MB by default) ❖ Non-overlapping within each level ❖ Read from a single SSTable 90% of the time ❖ Twice as much I/O as Size-Tiered ❖ Useful for : ❖ Rows Are Frequently Updated ❖ High Sensitivity to Read Latency ❖ Deletions or TTLed Columns in Wide Rows

Compaction : Date-Tiered ❖ By Björn Hegerfors at Spotify ❖
Designed for time series-like data ❖ The only strategy based on the data aged ❖ Group SSTables in windows based on how old the data is ❖ Each SSTable is 4 (by default) times the size of the previous window ❖ If using TTL, one entire SSTable can be dropped ❖ Written data time should be continuous : can write old data

Read Concerns ❖ Models your tables for reads  (and stay
concerned by writes) ❖ Use Denormalization ❖ Choose your primary keys ❖ Keep number of SSTables low  (choose your compaction strategy) ❖ Use high consistency if really needed

Nicolas DOUILLET / Lead Dev @Batch / @mini_marcel Thanks!!! Cassandra
Read & Write Requests

Cassandra / Read & Writes

Cassandra / Read & Writes

Other Decks in Programming

Featured

Transcript