Cassandra Storage Engine in 3.0 Lightning Talk

Cassandra Storage Engine Changes in 3.0 Andy Tolbert

Introduction © 2015 DataStax, All Rights Reserved. 2 •  The
cassandra data storage model as it exists today does not map well to CQL. •  More oriented to the storage format used back in the thrift days. •  This is a problem as CQL is the primary interface! •  Users have observed a wide variance in disk used between COMPACT STORAGE and the default format. •  30X: We Didn’t Use COMPACT STORAGE, You Won’t Believe What Happened Next •  30%: Understanding the impact of compact storage •  Storage engine has been refactored in 3.0 to rectify this and close the gap with compact storage. •  But does it? Let’s ﬁnd out!

CQL Schema & Glossary create table if not exists financial.symbol_history
( symbol text, year int, month int, day int, volume bigint, close double, open double, low double, high double, primary key ((symbol, year), month, day) ) with CLUSTERING ORDER BY (month desc, day desc); © 2015 DataStax, All Rights Reserved. 3 •  Partition – A division of rows by some shared identifier. •  Primary key – a column or collection of columns that uniquely define a row. •  Row – A collection of columns sharing a Primary key. •  Partition key – The first column or columns in a primary key which define what partition data belongs to. •  Clustering column – A column following the partition key. •  Clustering – The collection of clustering column(s) that defines the ordering of data within a partition.

SSTable •  SSTable (Sorted String Table) – An immutable file
containing data for a table. •  When data is inserted/updated/deleted (‘mutated’) it is added in memory to a memtable. •  C* will eventually flush a memtable to disk into an SSTable. •  SSTables merged and cleaned up through a process called ‘compaction’ •  In < C* 3.0 an SSTable is a series of Keys and their ‘Cells’. –  Key is a combination of the partition key. i.e. symbol: IBM, year: 2004 associates to ‘IBM:2004’. –  A Cell represents a column value or tombstone: •  value includes an identifier for the data being changed, the data, timestamp of the operation, and a TTL (if expiring). •  tombstone includes a deletion identifier which indicates what is being deleted, the timestamp of the delete, and the time it should be deleted. A tombstone can be at a cell level or for a grouping of cells (like a row). © 2015 DataStax, All Rights Reserved. 4

sstable2json •  sstable2json is a useful tool for getting a
human readable sstable. © 2015 DataStax, All Rights Reserved. 5 {"key": "IBM:2004", "cells": [["12:31:close","98.58",1449463706349096], ["12:31:high","98.91",1449463706349096], ["12:31:low","98.49",1449463706349096], ["12:31:open","98.6",1449463706349096], ["12:31:volume","2801200",1449463706349096], ["12:30:close","98.3",1449463706349095], ["12:30:high","99.0",1449463706349095], ["12:30:low","98.07",1449463706349095], ["12:30:open","98.1",1449463706349095], ["12:30:volume","3812400",1449463706349095], ... IBM: 2004 12:31:close 98.58 12:31:high 98.91 12:31:low 98.49 12:31:open 98.6 12:31:volume 2801200 12:30:close 98.3 … YHOO: 2005 … … … … … … …

Observations © 2015 DataStax, All Rights Reserved. 6 •  Each
cell needed all clustering column values in addition to the column name. •  Even though the timestamp was the same across rows, it was duplicated. •  The association of columns is loose (but fortunately the data was ordered). •  How can we overcome this overhead? {"key": "IBM:2004", "cells": [["12:31:close","98.58",1449463706349096], ["12:31:high","98.91",1449463706349096], ["12:31:low","98.49",1449463706349096], ["12:31:open","98.6",1449463706349096], ["12:31:volume","2801200",1449463706349096], ["12:30:close","98.3",1449463706349095], ["12:30:high","99.0",1449463706349095], ["12:30:low","98.07",1449463706349095], ["12:30:open","98.1",1449463706349095], ["12:30:volume","3812400",1449463706349095], ...

Compact Storage © 2015 DataStax, All Rights Reserved. 7 • 
Exists mostly for legacy purposes (old thrift storage format), but is still usable with CQL. •  Limitations •  Can only have 1 column that isn’t the primary key. •  Can’t alter the table afterwards. create table if not exists financial.symbol_history ( symbol text, year int, month int, day int, volume bigint, close double, open double, low double, high double, primary key ((symbol, year), month, day, close, open, low, high) ) with CLUSTERING ORDER BY (month desc, day desc) and COMPACT_STORAGE;

Compact Storage visualized © 2015 DataStax, All Rights Reserved. 8
{"key": "IBM:2004", "cells": [["12:31:98.58:98.6:98.49:98.91","2801200",1449464448388015], ["12:30:98.3:98.1:98.07:99.0","3812400",1449464448388014], ... IBM: 2004 12:31:98.58:98.6:98.49:98.91 2801200 12:30:98.3:98.1:98.07:99.0 3812400 … YHOO: 2005 … … …

COMPACT STORAGE summarized © 2015 DataStax, All Rights Reserved. 9
•  On disk, less data is stored. This will help us in many ways: •  Reduces storage cost. •  Less data to read/write, less I/O, can do more. •  But still, you are using a legacy format, which means: •  It’s clumsy with CQL. •  not using as intended. •  might not be around forever. Format Size on Disk % delta default Default 972.30 MB -- Compact Storage 429.65 MB -55.82%

Storage Engine Rewrite in 3.0 © 2015 DataStax, All Rights
Reserved. 10 •  CASSANDRA-8099 – Overview •  Substantial Refactor of the Storage Engine •  No longer a collection of Keys and their Cells, instead a collection of Partitions and their Rows. •  Tombstones at Row level are no longer Cells, they are now at the same level Rows. •  You can now do range deletes! (DELETE FROM symbol_history where symbol=‘IBM’ and year=2004 and month >= 7 and month <=9) •  Should open up opportunity for more nice enhancements. •  Common elements shared between Rows and their cells. (column names, timestamps, ttls, column metadata) •  Delta encoding for shared data. •  Static columns grouped under a ‘Row’ at beginning of key, instead of cells.

sstabledump in 3.0 •  Added in 3.0.4, via CASSANDRA-7464 ©
2015 DataStax, All Rights Reserved. 11 { "key": "IBM:2004", "rows": [ { "clustering": {"month": "12", "day": "31"}, "cells": { ["close","98.58",1449469247948011], ["high","98.91",1449469247948011], ["low","98.49",1449469247948011], ["open","98.6",1449469247948011], ["volume","2801200",1449469247948011] } }, { "clustering": {"month": "12", "day": "30"}, "cells": { ["close","98.3",1449469247948010], ["high","99.0",1449469247948010], ["low","98.07",1449469247948010], ["open","98.1",1449469247948010], ["volume","3812400",1449469247948010] } }, ParAAon Header Row Row Row IBM:2004 <column metadata> Clustering (12, 31) LiveInfo * Cells: 98.58|98.91|98.49|98.6|2801200 Clustering (12, 30) LiveInfo * Cells: 98.3|99.0|98.07|98.1|3812400 … YHOO:2005 … … …

Observations •  Since column names are kept with the partition
header and not duplicated per cell, this saves a lot of space and closes the gap (almost) with COMPACT STORAGE: © 2015 DataStax, All Rights Reserved. 12 •  Column Cells are now grouped by rows, nice! •  Clustering column values are top-level to the row and not individual cells. •  ‘LiveInfo’ was originally not populated with timestamp even though all columns shared the same timestamp? •  This could be a bug and would save a lot of space to not duplicate timestamps. •  It has been ﬁxed – the numbers should be much closer to compact storage now. •  Other •  The API is much nicer, was really easy updating sstabledump to work with it. A lot easier to scan to a partition in an SSTable. •  I encountered a nasty bug CASSANDRA-10822 where SSTables containing row tombstones ended up causing data to be omitted during upgradesstables. This has since been ﬁxed. •  Without sstabledump, I would not have noticed ‘LiveInfo’ not being populated and also would not have understood what was causing CASSANDRA-10822. Format Size on Disk % delta default Default 972.30 MB -- Compact Storage 429.65 MB -55.82% C* 3.0 473.31 MB -51.33% C* 3.0 Compact Storage 383.36 MB -60.58%

Summary •  COMPACT STORAGE has been used as an alternative
to get around storage engine inefﬁciency, but this is not a good long term solution. •  C* 3.0 refactors the storage engine to map better with CQL and optimize storage. •  Tools are useful for debugging, sstable2json (1.x,2.x) -> sstabledump (3.x+). •  Bugs were encountered, but have been ﬁxed since then. You can safely upgrade from 2.x to 3.x. •  … questions? •  Thanks! © 2015 DataStax, All Rights Reserved. 13

Notes – May 2016 © 2015 DataStax, All Rights Reserved.
14 •  A lot has changed since this talk was given in Dec 2015 •  sstabledump has replaced sstable2json in C* 3.0. Read about it here. •  Implementation from this talk was much improved and put into C* with much assistance from Chris Lohfink and Yuki Morishita, thanks guys! •  There a lot of good resources on the storage engine change, see end of blog post for links. •  The issues I mentioned in previous talks no longer exist. •  Upgrade issue around row tombstone causing other cells to not be included in sstables has been fixed (CASSANDRA-10822) •  The upgrade process now properly coalesces cells with the same timestamps to the clustering liveness info level, giving more space benefit. •  The slides have been updated to reflect all of this.

Cassandra Storage Engine in 3.0 Lightning Talk

Cassandra Storage Engine in 3.0 Lightning Talk

Andrew Tolbert

More Decks by Andrew Tolbert

Other Decks in Programming

Featured

Transcript

Cassandra Storage Engine Changes in 3.0 Andy Tolbert

Introduction © 2015 DataStax, All Rights Reserved. 2 •  The

CQL Schema & Glossary create table if not exists financial.symbol_history

SSTable •  SSTable (Sorted String Table) – An immutable ﬁle

sstable2json •  sstable2json is a useful tool for getting a

Observations © 2015 DataStax, All Rights Reserved. 6 •  Each

Compact Storage © 2015 DataStax, All Rights Reserved. 7 •

Compact Storage visualized © 2015 DataStax, All Rights Reserved. 8

COMPACT STORAGE summarized © 2015 DataStax, All Rights Reserved. 9

Storage Engine Rewrite in 3.0 © 2015 DataStax, All Rights

sstabledump in 3.0 •  Added in 3.0.4, via CASSANDRA-7464 ©

Observations •  Since column names are kept with the partition

Summary •  COMPACT STORAGE has been used as an alternative

Notes – May 2016 © 2015 DataStax, All Rights Reserved.