What Reading 5 Papers can yield for your Business

Answer to question of life, the Universe and everything

But what’s even the question?

Somehow

Somehow Somehow: numbers evolve around percentiles clustering evolves around words
& spam filters regressions are about predicting page views combinations/permutations are not even mentioned

What reading 5 papers can yield for business

Minimal toolchain to yield info

Get fields (+ groups/permutations) ! Counts for categorical values !
Distributions for numerical values

Getting first impression of data

How many?

Did you see that one?

What’s the distribution?

How different are these?

Does it belong here?

Getting info

`someone` said “Measure everything” Let’s just say that you can
efficiently measure locally, merge on stat machines percentiles are cool and everything, but mostly for latency

Same approaches are used in databases HFT high-scale data analysis
! So should work for pretty much anyone.

OLAP Principles

Consolidation

Time: Tumbling

2-1 = 1, less than 5 buffer

3-1 = 2, less than 5

ok, time to emit!

emitting values:

new buffer

etc etc etc

Time: Sliding

2-1 = 1, less than 5 buffer

3-1 = 2, less than 5

ok, time to emit!

emitting values:

etc etc etc

Grouping

Joining

(R ⋈ S) cartesian product

Observations (if group / reduce won’t do)

Count-min Sketch (how many?)

Count-min sketch Millions of IPs How many requests per IP?
Millions Object Allocations How many Objects of type N were allocated? Efficient for large amount of “keys” Approximates minimum amount of occurrences

Hashing Non-cryptographic hash function (like Murmur3) We need N linear
hashing fns so we take Random Seed

Update Given: K(rows) M (width) table(KxM) ! for i in
(0..K) h = hash(item, seed) position = h % M table[i][position] += 1 end

Lookup Given: K(rows) M (width) table(KxM) ! for i in
(0..K) h = hash(item, seed) position = h % M return table[i][position] end

Top-K Hold a CM-sketch Do an update on each event
occurrence Do a lookup in CM-sketch Add looked up key + value to the top-list If top-list is over capacity, trim the smallest value

Streaming Histogram

Update Given: h={(p1 ,m1 ), . . . , (pB
,mB )} ! if p = pi for some i mi = mi + 1 else find min <-> between points merge (pi ,mi ) and (pi+1 ,mi+1 ) replace (pi ,mi ) with merged end

Update Given: h={(p1 ,m1 ), . . . , (pB
,mB )} ! if p = pi for some i mi = mi + 1 else mB+1 =p (expanding) if (B + 1 > max-size) find min <-> between points merge (pi ,mi ) and (pi+1 ,mi+1 ) replace (pi ,mi ) with merged end end

Histogram Initially written for streaming decision-trees Non-approximate, will give you
exact stats Lightweight, since only has to keep N bins Very useful when you don’t know value bounds

HyperLogLog

Number of distinct items in X To use with any
of the tricks from `consolidation` section Approximate, although error is known HyperLogLog

K-means

Operates on vectors (double) Assign each observation to cluster Take
Eucledian distance measure, by minimising for each dimension ! Calculate new means to be centroids Repeat until converges Use results to yield a model K-means

Naive Bayes

Operates on vectors (double) Model = class, variance, cluster mean
Find the cluster closest to the observation 100% streamable Naive bayes

Minimum amount of memory 100% commutative (Many) more algorithms available
out there Data will show the way, no golden hammer Conclusions

Learn yourself some statistics probability theory Don’t wait for use
case, find profit in data now Conclusions

@ifesdjeen

What Reading 5 Papers can yield for your Business

What Reading 5 Papers can yield for your Business

More Decks by αλεx π

Featured

Transcript