Cassandra for Data Analytics Backends

@ifesdjeen

Cassandra Monitoring

Precision

is not same as

Semantics

is not same as

Anomaly detection

Do you see the elephant being swallowed by the snake?

Agenda

Ad-hoc queries

Aggregations Fast

Machine Learning

parallel queries Step 1

+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+

Used to avoid timestamp resolution collisions To ensure sub-resolution order
Snapshot the data on overflow or timeout Ensures idempotence Sequence ID

Fighting Dispersion

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13 Range Tables

Full Table Scan ts1 ts2 ts3 ts4 ts5 ts6 ts7
ts8 ts9 ts10 ts11 ts12 ts13 Start End

ts11 ts12 ts13

Open Range Start End ts1 ts2 ts3 ts4 ts5 ts6
ts7 ts8 ts9 ts10 ts11 ts12 ts13

ts11 ts12 ts13

“Between” Range ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8
ts9 ts10 ts11 ts12 ts13 Start End

ts11 ts12 ts13

(rich query API) Step 2 add some algebra

Stream Fusion for rich ad-hoc queries

What is even Stream Fusion

map filter reduce

single step mapFilterReduce

data Step data cursor = Yield data !cursor | Skip
!cursor | Done data Stream data = ∃s. Stream (cursor → Step data cursor) cursor

Stream Beginning: reading from the DB

map Yield data cursor → Yield (f cursor) cursor Skip
cursor → Skip cursor Done → Done maps :: (a → b) → Stream a → Stream b

filter Yield data cursor | p data → Yield data
cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done filters :: (a → Bool) → Stream a → Stream a

reduce/fold Yield x cursor → loop (f data x) cursor
Skip cursor → loop data cursor Done → z foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc

Append class Monoid a where mempty :: a mappend ::
a -> a -> a -- ^ Identity of 'mappend' -- ^ An associative operation

class (Monoid intermediate) => Aggregate intermediate end where combine ::
intermediate -> end Combine

data Count = Count Int instance Monoid Count where mempty
= Count 0 mappend (Count a) (Count b) = Count $ a + b instance Aggregate Count Int where combine (Count a) = a Count Example

add some ML Step 3

Storing Models

Support Vector Machines

Hyperplane α·x - φ = 1

[ α1 α1 α1 ...αn ] ρ

Option 1: list<double>

CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY
KEY(path))

Problems High deserialisation overhead Need to add PK specifiers for
multiple SVs

Alternative: blob & byte buffers

Vector Representation

0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+ byte address points 1 2 3 4 0 n

Matrix Representation

0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ m1 m2 m3 m4 m0 mn

Advantages No serialisation overhead Fast relative access Easy to go
multi-dimensional Easy to implement atomic in-memory operations

Bayesian Classifiers

P(X | blue)= Number of Blue near X Total number
of blue P(X | red)= Number of Red near X Total number of Red

[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)] ... [Mean(xn), Var(xn)]]

make it rocket-fast Step 4

Approximate Data Structures

Bloom Filters are basically long arrays / vectors

BitSet

0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ bit address

Advantages 64 bits per 8-byte Long Easy to represent by
the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations

Count-min sketches are basically int matrices

Histograms are basically long vectors

Conclusions Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good
API fits

@ifesdjeen

Cassandra for Data Analytics Backends

Cassandra for Data Analytics Backends

More Decks by αλεx π

Other Decks in Research

Featured

Transcript