Algebird : Abstract Algebra for Big Data Analytics.

Slide 1

Slide 1 text

Algebird Abstract Algebra for Analytics Sam BESSALAH @samklr

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Abstract Algebra

Slide 6

Slide 6 text

From WikiPedia

Slide 7

Slide 7 text

Algebraic Structure “ Set of values, coupled with one or more finite operations,and a set of laws those operations must obey. “

Slide 8

Slide 8 text

Algebraic Structure “ Set of values, coupled with one or more finite operations, and a set of laws those operations must obey. “ e.g Sum, Magma, Semigroup, Groups, Monoid, Abelian Group, Semi Lattices, Rings, Monads, etc.

Slide 9

Slide 9 text

Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity)

Slide 10

Slide 10 text

Semigroup Semigroup Law : (x <> y) <> z = x <> (y <> z) (associativity) trait Semigroup[T] { def aggregate(x : T, y : T) : T }

Slide 11

Slide 11 text

Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity)

Slide 12

Slide 12 text

Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identiy / zero) trait Monoid[T] { def identity : T def aggregate (x, y) : T }

Slide 13

Slide 13 text

Monoids Monoid Laws : (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x trait Monoid[T] extends Semigroup[T]{ def identity : T }

Slide 14

Slide 14 text

Groups Group Laws: (x <> y) <> z = x <> (y <> z) (associativity) identity <> x = x x <> identity = x (identity) x <> inverse x = identity inverse x <> x = identity (invertibility)

Slide 15

Slide 15 text

Groups Group Laws (x <> y) <> z = x <> (y <> z) identity <> x = x x <> identity = x x <> inverse x = identity inverse x <> x = identity trait Group[T] extends Monoid[T]{ def inverse (v : T) :T }

Slide 16

Slide 16 text

Many More - Abelian groups (Commutative Sets) - Rings - Semi Lattices - Ordered Semigroups - Fields .. Many of those are in Algebird ….

Slide 17

Slide 17 text

Examples - (a min b) min c = a (b min c) with Int. - a max ( b max c) = (a max b) max c ** - a or (b or c) = (a or b) or c - a and (b and c) = (a and b) and c - int addition - set union - harmonic sum - Integer mean - Priority queue

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Why do we need those algebraic structures ?

Slide 20

Slide 20 text

We want to : - Build scalable analytics systems - Leverage distributed computing to perform aggregation on really large data sets. - A lot of operations in analytics are just sorting and counting at the end of the day

Slide 21

Slide 21 text

Distributed Computing → Parallellism

Slide 22

Slide 22 text

Distributed Computing → Parallellism Associativity → enables parallelism

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Distributed Computing → Parallellism Associativity enables parallelism Identity means we can ignore some data Commutativity helps us ignore order

Slide 25

Slide 25 text

Typical Map Reduce ...

Slide 26

Slide 26 text

Finding Top-K Elements in Scalding ... class TopKJob(args : Args) extends Job (args) { Tsv ( args(‘input’), visitScheme) .filter (. ..) .leftJoinWithTiny ( … ) .filter ( … ) .groupBy( ‘fieldOne) { _.sortWithTake (visitScheme -> top } (biggerSale) .write(Tsv(...) ) }

Slide 27

Slide 27 text

.sortWithTake( … ) Looking into .sortWithTake in Scalding, there’s one nice thing : class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ]

Slide 28

Slide 28 text

class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] Let’s take a look : PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 top-4 (PQ1 U PQ2 ): 100, 80, 55, 45 Priority Queue : Can be empty Two Priority Queues can be “added” in any order Associative + Commutative

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Stream Mining Challenges - Update predictions after each observation - Single pass : can’t read old data or replay the stream - Full size of the stream often unknown - Limited time for computation per observation - O(1) memory size

Slide 31

Slide 31 text

Stream Mining Challenges http://radar.oreilly.com/2013/10/stream-mining-essentials.html

Slide 32

Slide 32 text

Tradeoff : Space and speed over accuracy.

Slide 33

Slide 33 text

Tradeoff : Space and speed over accuracy. use sketches.

Slide 34

Slide 34 text

Sketches Probabilistic data structures that store a summary (hashed mostly)of a data set that would be costly to store in its entirety, thus providing most of the time, sublinear algorithmic properties. E.g Bloom Filters, Counter Sketch, KMV counters, Count Min Sketch, HyperLogLog, Min Hashes

Slide 35

Slide 35 text

Bloom filters Approximate data structure for set membership Behaves like an approximate set BloomFilter.contains(x) => NO | Maybe P(False Positive) > 0 P(False Negative) = 0

Slide 36

Slide 36 text

Internally : Bit Array of fixed size add(x) : for all element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i. (Boolean AND => associative) Both are associative => BF can be designed as a Monoid

Slide 37

Slide 37 text

Bloom filters import com.twitter.algebird._ import com.twitter.algebird.Operators._ // generate 2 lists val A = (1 to 300).toList // Generate a Bloomfilter val NUM_HASHES = 6 val WIDTH = 6000 // bits val SEED = 1 implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED) // approximate set with bloomfilter val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _) val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)

Slide 38

Slide 38 text

Count Min Sketch Gives an approximation of the number of occurrences of an element in a set.

Slide 39

Slide 39 text

Count Min Sketch Count min sketch Adding an element is a numerical addition Querying uses a MIN function. Both are associative. useful for detecting heavy hitters, topK, LSH We have in Algebird :

Slide 40

Slide 40 text

HyperLogLog Popular sketch for cardinality estimtion. Gives within a probilistic distribution of an error the number of distinct values in a data set. HLL.size = Approx[Number] Intuition Long runs of trailings 0 in a random bits chain are rare But the more bit chains you look at, the more likely you are to find a long one The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.

Slide 41

Slide 41 text

Adding an element uses a Max and Sum function. Both are associative and Monoids. (Max is an ordered semigroup in Algebird really) Querying for an element uses an harmonic mean which is a Monoid. In Algebird :

Slide 42

Slide 42 text

Many More juicy sketches ... - MinHashes to compute Jaccard similarity - QTree for quantiles estimation. Neat for anomaly detection. - SpaceSaverMonoid, Awesome to find the approximate most frequent and top K elements. - TopKMonoid - SGD, PriorityQueues, Histograms, etc.

Slide 43

Slide 43 text

SummingBird : Lamba in a box

Slide 44

Slide 44 text

Heard of Lambda Architecture ?

Slide 45

Slide 45 text

SummingBird Same code for both batch and real time processing.

Slide 46

Slide 46 text

SummingBird Same code, for both batch and real time processing. But works only on Monoids. Uses Storehaus, as a mergeable store layer.

Slide 47

Slide 47 text

http://github.com/twitter/algebird

Slide 48

Slide 48 text

http://github.com/twitter/algebird

Slide 49

Slide 49 text

These slides : http://bit.ly/1szncAZ http://slidesha.re/1zhhXKU

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

-Algebra for analytics by Oscar Boykin (Creator of Algebird) http://speakerdeck.com/johnynek/algebra-for-analytics - Take a look into HLearn https://github.com/mikeizbicki/HLearn - Great intro into Algebird by Michael Noll http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid- monad-for-large-scala-data-analytics/ -Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of- the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure - Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic- structures-web-analytics-data-mining/ - http://debasishg.blogspot.fr/2014/01/count-min-sketch-data- structure-for.html - http://infolab.stanford.edu/~ullman/mmds/ch3.pdf Links