Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Cassandra for Data Analytics Backends
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
αλεx π
September 24, 2015
Research
460
7
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Cassandra for Data Analytics Backends
αλεx π
September 24, 2015
More Decks by αλεx π
See All by αλεx π
Scalable Time Series With Cassandra
ifesdjeen
1
420
Bayesian Inference is known to make machines biased
ifesdjeen
2
400
Stream Processing and Functional Programming
ifesdjeen
1
790
PolyConf 2015 - Rocking the Time Series boat with C, Haskell and ClojureScript
ifesdjeen
0
520
Clojure - A Sweetspot for Analytics
ifesdjeen
8
2.1k
Going Off Heap
ifesdjeen
3
1.9k
Always be learning
ifesdjeen
1
190
Learn Yourself Emacs For Great Good workshop slides
ifesdjeen
3
350
What Reading 5 Papers can yield for your Business
ifesdjeen
0
390
Other Decks in Research
See All in Research
AI Agentの精度改善に見るML開発との共通点 / commonalities in accuracy improvements in agentic era
shimacos
6
1.7k
セマンティック通信勉強会 6Gに向けたデバイス間効率的な通信の技術紹介・課題・今後展望
satai
3
170
第12回人と環境にやさしい交通をめざす全国大会/熊本都市圏「車1割削減、渋滞半減、公共交通2倍」をめざして
trafficbrain
0
120
「なんとなく」の顧客理解から脱却する ──顧客の解像度を武器にするインサイトマネジメント
tajima_kaho
10
7.6k
RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent
satai
2
310
Fukui Shibiten 39 - AI Art
butchi
0
130
Anthropic が提案する LLM の内部状態を自然言語で説明可能にした Natural Language Autoencoders / Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
shunk031
0
130
通時的な類似度行列に基づく単語の意味変化の分析
rudorudo11
0
320
長時間動画QAにおけるマルチエージェント推論 ・SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
murakawatakuya
1
130
Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering
anatolykr
0
210
2026年度 生成AI を活用した論文執筆ガイド/ワークショップ / 2026 Academic Year Guide to Writing Papers Using Generative AI - Workshop
ks91
PRO
0
170
COFFEE-Japan PROJECT Impact Report(Uminomukou Coffee)
ontheslope
0
200
Featured
See All Featured
Data-driven link building: lessons from a $708K investment (BrightonSEO talk)
szymonslowik
1
1.1k
Game over? The fight for quality and originality in the time of robots
wayneb77
1
200
Joys of Absence: A Defence of Solitary Play
codingconduct
1
400
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
780
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
140
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
1
250
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
133
19k
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
4k
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
250
What Being in a Rock Band Can Teach Us About Real World SEO
427marketing
0
260
Transcript
@ifesdjeen
Cassandra Monitoring
None
Precision
is not same as
Semantics
is not same as
Anomaly detection
Do you see the elephant being swallowed by the snake?
Agenda
Ad-hoc queries
Aggregations Fast
Machine Learning
parallel queries Step 1
+---------------+---------------+ | timestamp | sequenceId | +---------------+---------------+
Used to avoid timestamp resolution collisions To ensure sub-resolution order
Snapshot the data on overflow or timeout Ensures idempotence Sequence ID
Fighting Dispersion
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13 Range Tables
Full Table Scan ts1 ts2 ts3 ts4 ts5 ts6 ts7
ts8 ts9 ts10 ts11 ts12 ts13 Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
Open Range Start End ts1 ts2 ts3 ts4 ts5 ts6
ts7 ts8 ts9 ts10 ts11 ts12 ts13
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
“Between” Range ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8
ts9 ts10 ts11 ts12 ts13 Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10
ts11 ts12 ts13
(rich query API) Step 2 add some algebra
None
Stream Fusion for rich ad-hoc queries
What is even Stream Fusion
map filter reduce
single step mapFilterReduce
data Step data cursor = Yield data !cursor | Skip
!cursor | Done data Stream data = ∃s. Stream (cursor → Step data cursor) cursor
Stream Beginning: reading from the DB
map Yield data cursor → Yield (f cursor) cursor Skip
cursor → Skip cursor Done → Done maps :: (a → b) → Stream a → Stream b
filter Yield data cursor | p data → Yield data
cursor | otherwise → Skip cursor Skip cursor → Skip cursor Done → Done filters :: (a → Bool) → Stream a → Stream a
reduce/fold Yield x cursor → loop (f data x) cursor
Skip cursor → loop data cursor Done → z foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc
Append class Monoid a where mempty :: a mappend ::
a -> a -> a -- ^ Identity of 'mappend' -- ^ An associative operation
class (Monoid intermediate) => Aggregate intermediate end where combine ::
intermediate -> end Combine
data Count = Count Int instance Monoid Count where mempty
= Count 0 mappend (Count a) (Count b) = Count $ a + b instance Aggregate Count Int where combine (Count a) = a Count Example
add some ML Step 3
Storing Models
Support Vector Machines
Hyperplane α·x - φ = 1
[ α1 α1 α1 ...αn ] ρ
Option 1: list<double>
CREATE TABLE support_vectors( path varchar, alpha list<double>, phi int, PRIMARY
KEY(path))
Problems High deserialisation overhead Need to add PK specifiers for
multiple SVs
Alternative: blob & byte buffers
Vector Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+----+----+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+----+----+----+ byte address points 1 2 3 4 0 n
Matrix Representation
0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α
| α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ 01 02 03 04 00 1n m*n*8+ 0 8 16 24 32 40 n*8 +----+----+----+----+----+---------+----+ | α | α | α | α | α | ... | α | +----+----+----+----+----+---------+----+ m1 m2 m3 m4 m0 mn
Advantages No serialisation overhead Fast relative access Easy to go
multi-dimensional Easy to implement atomic in-memory operations
Bayesian Classifiers
P(X | blue)= Number of Blue near X Total number
of blue P(X | red)= Number of Red near X Total number of Red
[[Mean(x1), Var(x1)] [Mean(x2), Var(x3)] ... [Mean(xn), Var(xn)]]
0 8 16 +---------+---------+ | Mean(x )| Var(x ) |
+---------+---------+ 0 0 16 24 32 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+ 1 1 2n*8 (2n+1)*8 +---------+---------+ | Mean(x )| Var(x ) | +---------+---------+ n n byte address payloads
make it rocket-fast Step 4
Approximate Data Structures
Bloom Filters are basically long arrays / vectors
BitSet
0 8 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 8 16 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 16 24 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ 24 32 +---+---+---+---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+---+---+---+ bit address
Advantages 64 bits per 8-byte Long Easy to represent by
the long-array using offsets, bit shifts and masks Easy to implement atomic in-memory operations
Count-min sketches are basically int matrices
Histograms are basically long vectors
Conclusions Ad-hoc queries Parallelism Lightweight DSs representation Optimisations and good
API fits
@ifesdjeen