Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra / Read & Writes

Cassandra / Read & Writes

Understanding how a database works allows you to use it more efficiently, and it's even more true when it comes to NOSQL databases. Let's see how Cassandra handle read and write requests on a ring and how compactions have an impact on requests.

Nicolas DOUILLET

September 16, 2015
Tweet

Other Decks in Programming

Transcript

  1. Basics : key/value ❖ Cassandra is a key/value store ❖

    Write requests concern ONE partition key ❖ Read requests mainly concern ONE partition key ❖ You (almost) can’t filter on other columns ❖ No joins, no constraints …
  2. Basics : Schema ❖ A “database” is called a Keyspace

    ❖ A “table” is called a ColumnFamily 
 (with cql3 we usually use the term table) ❖ Two primary keys : ❖ the Partition Key (let’s consider only this one for now)
 The only way to access your data ❖ the Cluster Key (we’ll talk about it later) ❖ A ColumnFamily contains a list of Rows ❖ A Row contains a list of Columns ❖ In other terms : Map<String,Map<String,?>>
  3. Basics : Repartition ❖ The Ring is divided in VNodes

    ❖ Each VNode is a range of Tokens ❖ Each physical Node obtains a list of VNodes ❖ Placement of a row in the Ring determined by the hash of the PartitionKey
 (Using the Partitioner)
  4. Basics : Replication ❖ Replication Factor (RF) ❖ The RF

    is defined per Keyspace ❖ 3 replication strategies : 
 Local, Simple, NetworkTopology ❖ RF=3, the data are replicated on 3 nodes ❖ 4 nodes and RF=3, each node owns 75% CREATE KEYSPACE myKeyspace WITH replication = 
 {'class': 'SimpleStrategy', 'replication_factor': ‘3'};
  5. Basics : Consistency ❖ Cassandra is AP 
 (according to

    the CAP theorem) ❖ Cassandra offers a tunable Consistency
 (Tradeoff between consistency and latency) ❖ The Consistency Level is defined on each request ❖ Levels : Any (for writes only), One, Two, Three, All and
 Quorum = RF/2 + 1 ❖ RF=3 => 2 replicas involved in the query in Quorum
  6. Basics : Simple tables (with cql3) CREATE TABLE user (

    id uuid,
 firstname text,
 lastname text,
 birthdate timestamp,
 PRIMARY KEY (id) ); CREATE TABLE user (…)
 WITH bloom_filter_fp_chance = 0.1
 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
 AND compaction = {'min_threshold':'4', 'class':'LeveledCompactionStrategy', 'max_threshold': '32'}
 AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND read_repair_chance = 0.0; CREATE TABLE device ( id uuid,
 platform text,
 lastvisit timestamp,
 PRIMARY KEY ((id,platform)) ); Table parameters PartitionKey
  7. Basics : Wide Rows (with cql3) ❖ Store on one

    PartitionKey a set of records ❖ ClusterKey, part of the PrimaryKey ❖ CQL3 hides the wide row mechanism : 
 you keep working on records. ❖ ClusterKey accepts Comparison operators
 (Unlike the PartitionKey that allows you only equality) ❖ ClusterKey accepts Ordering ❖ Huge number of rows (2 Billions)
  8. Basics : ClusterKey (with cql3) CREATE TABLE user_device ( userid

    uuid, deviceid uuid, platform text,
 last_use timestamp,
 version string,
 PRIMARY KEY ((userid), deviceid, platform) ); How values are stored PartitionKey ClusterKey userid 1:ANDROID: 1:ANDROID:last_use 1:IOS: 1:IOS:last_use 1:IOS:version 1 2015-09-16 18:56:00Z 2015-08-02 15:20:00Z 8.3 3:ANDROID: 3:ANDROID:last_use 3:ANDROID:version 2 2015-09-10 22:03:00Z 4.4 … PartitionKey ClusterKey Composite Column
  9. Writes : Coordinator N4 The client chooses a coordinator. N5

    Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  10. Writes : Replicas N4 The coordinator determines the alive replicas

    involved.
 By hashing the Key with the Partitioner. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  11. Writes : Perform Write N4 The coordinator sends a Write

    request on each replica. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  12. Writes : Wait for response N4 The coordinator waits until

    enough nodes have responded.
 
 According to the CL, 2 in this example. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  13. Writes : Response N4 Finally, the coordinator gives the response

    back to the client. N5 Coordinator N1 N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  14. Writes : on a node ❖ The Structure is updated

    in memory : Memtable ❖ In the mean time, the write is appended to the Commit log
 (durability) ❖ A timestamp is associated to each updated column ❖ The Memtable is flushed on the disk : SSTable ❖ Data is the Commit log is purged. During the Write Asynchronously When the contents exceed a threshold
  15. Writes : SSTable ❖ Data File : ❖ Key/Values sorted

    by keys
 Sorted Strings Table ❖ Index File : ❖ (Key, offset) pairs ❖ Bloom Filter ❖ May contain only parts of the data of a CF or a key ❖ Concerns one ColumnFamily ❖ Multiple SSTables for one ColumnFamily ❖ Immutable
  16. More on Writes ❖ Writes are really fast ❖ Cassandra

    doesn’t check if the row already exists ❖ Inserts and updates are almost similar : upsert ❖ A delete creates a tombstone ❖ Cassandra supports Batches ❖ Batches can be atomic
  17. Writes : concerns ❖ Repartition on nodes (uuid and tuuid)

    ❖ Hot spots ❖ Choose RF according to the consistency ❖ Prefer idempotent inserts/updates ❖ Use timestamps on queries ❖ Prefer concurrent writes to a batch ❖ Trust your cluster
 (low consistency is most of the time sufficient)
  18. Reads : Coordinator N4 The client chooses a coordinator. N5

    N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  19. Reads : Replicas N4 The coordinator determines the alive replicas

    involved, sorted by proximity. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  20. Reads : Replicas N4 The coordinator determines replicas to repair

    : read repair. 
 According to the repair chance, on the ColumnFamily. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  21. Reads : First pass N4 The coordinator sends a Read

    Request and Digest Requests. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest
  22. Reads : Wait for responses N4 The coordinator waits until

    enough nodes have responded.
 
 According to the CL, and if data is present. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Digest Digest
  23. Reads : Second pass N4 If digest doesn’t match, the

    coordinator sends a Read Request to all replicas. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read
  24. Reads : Wait for responses (again) N4 The coordinator waits

    until enough nodes have responded.
 
 According to the CL, 2 in this example. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Read Read Read
  25. Reads : Resolve N4 The coordinator resolves the data and

    sends Repair Requests to outdated nodes. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum Repair Repair Repair
  26. Reads : Response N4 Finally the coordinator gives the response

    back to the client. N5 N1 Coordinator N2 N3 Client ❖ 5 nodes ❖ RF=3 ❖ CL = Quorum
  27. Reads : on a node ❖ Search in SSTables ❖

    Use the bloom filter to filter SSTables ❖ Use index to find row, and columns ❖ Read & Merge data ❖ Search in memory ❖ The Memtable contains the most recent data 
 (updated lately) First Step Second Step Note ❖ A KeyCache is used to speed up column index seek ❖ A RowCache with frequently asked partitions can be used
  28. Compaction ❖ Reduce the number of SSTables by merging ❖

    Mainly processed in background (SSTables immutable) ❖ 3 built-in strategies : ❖ Size-Tiered (STCS) ❖ Leveled (LCS) ❖ Date-Tiered (DTCS)
  29. Compaction : Size-Tiered ❖ From Google BigTable Paper ❖ Merge

    similar-sized SSTables
 (when >4 by default) ❖ Would need 100% much free space during compaction ❖ No guarantees as to how many SSTables are read for one row ❖ Useful for : ❖ Write-heavy Workloads ❖ Rows Are Write-Once
  30. Compaction : Leveled ❖ Based on LevelDB
 From the Chromium

    team at Google ❖ Fixed small-sized SSTables (5MB by default) ❖ Non-overlapping within each level ❖ Read from a single SSTable 90% of the time ❖ Twice as much I/O as Size-Tiered ❖ Useful for : ❖ Rows Are Frequently Updated ❖ High Sensitivity to Read Latency ❖ Deletions or TTLed Columns in Wide Rows
  31. Compaction : Date-Tiered ❖ By Björn Hegerfors at Spotify ❖

    Designed for time series-like data ❖ The only strategy based on the data aged ❖ Group SSTables in windows based on how old the data is ❖ Each SSTable is 4 (by default) times the size of the previous window ❖ If using TTL, one entire SSTable can be dropped ❖ Written data time should be continuous : can write old data
  32. Read Concerns ❖ Models your tables for reads
 (and stay

    concerned by writes) ❖ Use Denormalization ❖ Choose your primary keys ❖ Keep number of SSTables low
 (choose your compaction strategy) ❖ Use high consistency if really needed