Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cassandra Storage Engine in 3.0 Lightning Talk

Cassandra Storage Engine in 3.0 Lightning Talk

Lightning Talk given at Cassandra MSP Meetup on 7 Dec 2015 (http://www.meetup.com/Minneapolis-St-Paul-Cassandra-Meetup/events/225345254/)

Avatar for Andrew Tolbert

Andrew Tolbert

December 07, 2015
Tweet

More Decks by Andrew Tolbert

Other Decks in Programming

Transcript

  1. Introduction © 2015 DataStax, All Rights Reserved. 2 •  The

    cassandra data storage model as it exists today does not map well to CQL. •  More oriented to the storage format used back in the thrift days. •  This is a problem as CQL is the primary interface! •  Users have observed a wide variance in disk used between COMPACT STORAGE and the default format. •  30X: We Didn’t Use COMPACT STORAGE, You Won’t Believe What Happened Next •  30%: Understanding the impact of compact storage •  Storage engine has been refactored in 3.0 to rectify this and close the gap with compact storage. •  But does it? Let’s find out!
  2. CQL Schema & Glossary create table if not exists financial.symbol_history

    ( symbol text, year int, month int, day int, volume bigint, close double, open double, low double, high double, primary key ((symbol, year), month, day) ) with CLUSTERING ORDER BY (month desc, day desc); © 2015 DataStax, All Rights Reserved. 3 •  Partition – A division of rows by some shared identifier. •  Primary key – a column or collection of columns that uniquely define a row. •  Row – A collection of columns sharing a Primary key. •  Partition key – The first column or columns in a primary key which define what partition data belongs to. •  Clustering column – A column following the partition key. •  Clustering – The collection of clustering column(s) that defines the ordering of data within a partition.
  3. SSTable •  SSTable (Sorted String Table) – An immutable file

    containing data for a table. •  When data is inserted/updated/deleted (‘mutated’) it is added in memory to a memtable. •  C* will eventually flush a memtable to disk into an SSTable. •  SSTables merged and cleaned up through a process called ‘compaction’ •  In < C* 3.0 an SSTable is a series of Keys and their ‘Cells’. –  Key is a combination of the partition key. i.e. symbol: IBM, year: 2004 associates to ‘IBM:2004’. –  A Cell represents a column value or tombstone: •  value includes an identifier for the data being changed, the data, timestamp of the operation, and a TTL (if expiring). •  tombstone includes a deletion identifier which indicates what is being deleted, the timestamp of the delete, and the time it should be deleted. A tombstone can be at a cell level or for a grouping of cells (like a row). © 2015 DataStax, All Rights Reserved. 4
  4. sstable2json •  sstable2json is a useful tool for getting a

    human readable sstable. © 2015 DataStax, All Rights Reserved. 5 {"key": "IBM:2004", "cells": [["12:31:close","98.58",1449463706349096], ["12:31:high","98.91",1449463706349096], ["12:31:low","98.49",1449463706349096], ["12:31:open","98.6",1449463706349096], ["12:31:volume","2801200",1449463706349096], ["12:30:close","98.3",1449463706349095], ["12:30:high","99.0",1449463706349095], ["12:30:low","98.07",1449463706349095], ["12:30:open","98.1",1449463706349095], ["12:30:volume","3812400",1449463706349095], ... IBM: 2004 12:31:close 98.58 12:31:high 98.91 12:31:low 98.49 12:31:open 98.6 12:31:volume 2801200 12:30:close 98.3 … YHOO: 2005 … … … … … … …
  5. Observations © 2015 DataStax, All Rights Reserved. 6 •  Each

    cell needed all clustering column values in addition to the column name. •  Even though the timestamp was the same across rows, it was duplicated. •  The association of columns is loose (but fortunately the data was ordered). •  How can we overcome this overhead? {"key": "IBM:2004", "cells": [["12:31:close","98.58",1449463706349096], ["12:31:high","98.91",1449463706349096], ["12:31:low","98.49",1449463706349096], ["12:31:open","98.6",1449463706349096], ["12:31:volume","2801200",1449463706349096], ["12:30:close","98.3",1449463706349095], ["12:30:high","99.0",1449463706349095], ["12:30:low","98.07",1449463706349095], ["12:30:open","98.1",1449463706349095], ["12:30:volume","3812400",1449463706349095], ...
  6. Compact Storage © 2015 DataStax, All Rights Reserved. 7 • 

    Exists mostly for legacy purposes (old thrift storage format), but is still usable with CQL. •  Limitations •  Can only have 1 column that isn’t the primary key. •  Can’t alter the table afterwards. create table if not exists financial.symbol_history ( symbol text, year int, month int, day int, volume bigint, close double, open double, low double, high double, primary key ((symbol, year), month, day, close, open, low, high) ) with CLUSTERING ORDER BY (month desc, day desc) and COMPACT_STORAGE;
  7. Compact Storage visualized © 2015 DataStax, All Rights Reserved. 8

    {"key": "IBM:2004", "cells": [["12:31:98.58:98.6:98.49:98.91","2801200",1449464448388015], ["12:30:98.3:98.1:98.07:99.0","3812400",1449464448388014], ... IBM: 2004 12:31:98.58:98.6:98.49:98.91 2801200 12:30:98.3:98.1:98.07:99.0 3812400 … YHOO: 2005 … … …
  8. COMPACT STORAGE summarized © 2015 DataStax, All Rights Reserved. 9

    •  On disk, less data is stored. This will help us in many ways: •  Reduces storage cost. •  Less data to read/write, less I/O, can do more. •  But still, you are using a legacy format, which means: •  It’s clumsy with CQL. •  not using as intended. •  might not be around forever. Format Size on Disk % delta default Default 972.30 MB -- Compact Storage 429.65 MB -55.82%
  9. Storage Engine Rewrite in 3.0 © 2015 DataStax, All Rights

    Reserved. 10 •  CASSANDRA-8099 – Overview •  Substantial Refactor of the Storage Engine •  No longer a collection of Keys and their Cells, instead a collection of Partitions and their Rows. •  Tombstones at Row level are no longer Cells, they are now at the same level Rows. •  You can now do range deletes! (DELETE FROM symbol_history where symbol=‘IBM’ and year=2004 and month >= 7 and month <=9) •  Should open up opportunity for more nice enhancements. •  Common elements shared between Rows and their cells. (column names, timestamps, ttls, column metadata) •  Delta encoding for shared data. •  Static columns grouped under a ‘Row’ at beginning of key, instead of cells.
  10. sstabledump in 3.0 •  Added in 3.0.4, via CASSANDRA-7464 ©

    2015 DataStax, All Rights Reserved. 11 { "key": "IBM:2004", "rows": [ { "clustering": {"month": "12", "day": "31"}, "cells": { ["close","98.58",1449469247948011], ["high","98.91",1449469247948011], ["low","98.49",1449469247948011], ["open","98.6",1449469247948011], ["volume","2801200",1449469247948011] } }, { "clustering": {"month": "12", "day": "30"}, "cells": { ["close","98.3",1449469247948010], ["high","99.0",1449469247948010], ["low","98.07",1449469247948010], ["open","98.1",1449469247948010], ["volume","3812400",1449469247948010] } }, ParAAon Header Row Row Row IBM:2004 <column metadata> Clustering (12, 31) LiveInfo * Cells: 98.58|98.91|98.49|98.6|2801200 Clustering (12, 30) LiveInfo * Cells: 98.3|99.0|98.07|98.1|3812400 … YHOO:2005 … … …
  11. Observations •  Since column names are kept with the partition

    header and not duplicated per cell, this saves a lot of space and closes the gap (almost) with COMPACT STORAGE: © 2015 DataStax, All Rights Reserved. 12 •  Column Cells are now grouped by rows, nice! •  Clustering column values are top-level to the row and not individual cells. •  ‘LiveInfo’ was originally not populated with timestamp even though all columns shared the same timestamp? •  This could be a bug and would save a lot of space to not duplicate timestamps. •  It has been fixed – the numbers should be much closer to compact storage now. •  Other •  The API is much nicer, was really easy updating sstabledump to work with it. A lot easier to scan to a partition in an SSTable. •  I encountered a nasty bug CASSANDRA-10822 where SSTables containing row tombstones ended up causing data to be omitted during upgradesstables. This has since been fixed. •  Without sstabledump, I would not have noticed ‘LiveInfo’ not being populated and also would not have understood what was causing CASSANDRA-10822. Format Size on Disk % delta default Default 972.30 MB -- Compact Storage 429.65 MB -55.82% C* 3.0 473.31 MB -51.33% C* 3.0 Compact Storage 383.36 MB -60.58%
  12. Summary •  COMPACT STORAGE has been used as an alternative

    to get around storage engine inefficiency, but this is not a good long term solution. •  C* 3.0 refactors the storage engine to map better with CQL and optimize storage. •  Tools are useful for debugging, sstable2json (1.x,2.x) -> sstabledump (3.x+). •  Bugs were encountered, but have been fixed since then. You can safely upgrade from 2.x to 3.x. •  … questions? •  Thanks! © 2015 DataStax, All Rights Reserved. 13
  13. Notes – May 2016 © 2015 DataStax, All Rights Reserved.

    14 •  A lot has changed since this talk was given in Dec 2015 •  sstabledump has replaced sstable2json in C* 3.0. Read about it here. •  Implementation from this talk was much improved and put into C* with much assistance from Chris Lohfink and Yuki Morishita, thanks guys! •  There a lot of good resources on the storage engine change, see end of blog post for links. •  The issues I mentioned in previous talks no longer exist. •  Upgrade issue around row tombstone causing other cells to not be included in sstables has been fixed (CASSANDRA-10822) •  The upgrade process now properly coalesces cells with the same timestamps to the clustering liveness info level, giving more space benefit. •  The slides have been updated to reflect all of this.