Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algebird : Abstract Algebra for Big Data Analytics.

Sam Bessalah
November 11, 2014

Algebird : Abstract Algebra for Big Data Analytics.

Devoxx 2014. Antwerp Belgium.
Tools in Action Room 4
Antwerp, Tue 11th Nov. 2014

Sam Bessalah

November 11, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Algebird
    Abstract Algebra
    for
    Analytics
    Sam BESSALAH
    @samklr

    View Slide

  2. View Slide

  3. View Slide

  4. View Slide

  5. Abstract Algebra

    View Slide

  6. From WikiPedia

    View Slide

  7. Algebraic Structure
    “ Set of values, coupled with one or
    more finite operations,and a set of
    laws those operations must obey. “

    View Slide

  8. Algebraic Structure
    “ Set of values, coupled with one or more
    finite operations, and a set of laws those
    operations must obey. “
    e.g Sum, Magma, Semigroup, Groups, Monoid,
    Abelian Group, Semi Lattices, Rings, Monads,
    etc.

    View Slide

  9. Semigroup
    Semigroup Law :
    (x <> y) <> z = x <> (y <> z)
    (associativity)

    View Slide

  10. Semigroup
    Semigroup Law :
    (x <> y) <> z = x <> (y <> z)
    (associativity)
    trait Semigroup[T] {
    def aggregate(x : T, y : T) : T
    }

    View Slide

  11. Monoids
    Monoid Laws :
    (x <> y) <> z = x <> (y <> z)
    (associativity)
    identity <> x = x
    x <> identity = x
    (identity)

    View Slide

  12. Monoids
    Monoid Laws :
    (x <> y) <> z = x <> (y <> z)
    (associativity)
    identity <> x = x
    x <> identity = x
    (identiy / zero)
    trait Monoid[T] {
    def identity : T
    def aggregate (x, y) : T
    }

    View Slide

  13. Monoids
    Monoid Laws :
    (x <> y) <> z = x <> (y <> z)
    (associativity)
    identity <> x = x
    x <> identity = x
    trait Monoid[T] extends Semigroup[T]{
    def identity : T
    }

    View Slide

  14. Groups
    Group Laws:
    (x <> y) <> z = x <> (y <> z)
    (associativity)
    identity <> x = x
    x <> identity = x
    (identity)
    x <> inverse x = identity
    inverse x <> x = identity
    (invertibility)

    View Slide

  15. Groups
    Group Laws
    (x <> y) <> z = x <> (y <> z)
    identity <> x = x
    x <> identity = x
    x <> inverse x = identity
    inverse x <> x = identity
    trait Group[T] extends Monoid[T]{
    def inverse (v : T) :T
    }

    View Slide

  16. Many More
    - Abelian groups (Commutative Sets)
    - Rings
    - Semi Lattices
    - Ordered Semigroups
    - Fields ..
    Many of those are in Algebird ….

    View Slide

  17. Examples
    - (a min b) min c = a (b min c) with Int.
    - a max ( b max c) = (a max b) max c **
    - a or (b or c) = (a or b) or c
    - a and (b and c) = (a and b) and c
    - int addition
    - set union
    - harmonic sum
    - Integer mean
    - Priority queue

    View Slide

  18. View Slide

  19. Why do we need those algebraic
    structures ?

    View Slide

  20. We want to :
    - Build scalable analytics systems
    - Leverage distributed computing to perform aggregation
    on really large data sets.
    - A lot of operations in analytics are just sorting and
    counting at the end of the day

    View Slide

  21. Distributed Computing → Parallellism

    View Slide

  22. Distributed Computing → Parallellism
    Associativity → enables parallelism

    View Slide

  23. View Slide

  24. Distributed Computing → Parallellism
    Associativity enables parallelism
    Identity means we can ignore some data
    Commutativity helps us ignore order

    View Slide

  25. Typical Map Reduce ...

    View Slide

  26. Finding Top-K Elements in Scalding ...
    class TopKJob(args : Args) extends Job (args) {
    Tsv ( args(‘input’), visitScheme)
    .filter (. ..)
    .leftJoinWithTiny ( … )
    .filter ( … )
    .groupBy( ‘fieldOne)
    { _.sortWithTake (visitScheme -> top }
    (biggerSale)
    .write(Tsv(...) )
    }

    View Slide

  27. .sortWithTake( … )
    Looking into .sortWithTake in Scalding, there’s one
    nice thing :
    class PiorityQueueMonoid[T] (max : Int)
    (implicit order : Ordering[T] )
    extends Monoid[Priorityqueue[T] ]

    View Slide

  28. class PiorityQueueMonoid[T] (max : Int)
    (implicit order : Ordering[T] )
    extends Monoid[Priorityqueue[T] ]
    Let’s take a look :
    PQ1 : 55, 45, 21, 3
    PQ2: 100, 80, 40, 3
    top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
    Priority Queue :
    Can be empty
    Two Priority Queues can be “added” in any order
    Associative + Commutative

    View Slide

  29. class PiorityQueueMonoid[T] (max : Int)
    (implicit order : Ordering[T] )
    extends Monoid[Priorityqueue[T] ]
    Let’s take a look :
    PQ1 : 55, 45, 21, 3
    PQ2: 100, 80, 40, 3
    top-4 (PQ1 U PQ2 ): 100, 80, 55, 45
    Priority Queue :
    Can be empty
    Two Priority Queues can be “added” in any order
    Associative + Commutative
    Makes Scalding go fast,
    by doing sorting,
    filtering and extracting
    in one single “map”
    step.

    View Slide

  30. Stream Mining Challenges
    - Update predictions after each observation
    - Single pass : can’t read old data or replay
    the stream
    - Full size of the stream often unknown
    - Limited time for computation per
    observation
    - O(1) memory size

    View Slide

  31. Stream Mining Challenges
    http://radar.oreilly.com/2013/10/stream-mining-essentials.html

    View Slide

  32. Tradeoff : Space and speed over
    accuracy.

    View Slide

  33. Tradeoff : Space and speed over
    accuracy.
    use sketches.

    View Slide

  34. Sketches
    Probabilistic data structures that store a summary
    (hashed mostly)of a data set that would be costly to
    store in its entirety, thus providing most of the
    time, sublinear algorithmic properties.
    E.g Bloom Filters, Counter Sketch, KMV counters,
    Count Min Sketch, HyperLogLog, Min Hashes

    View Slide

  35. Bloom filters
    Approximate data structure for set membership
    Behaves like an approximate set
    BloomFilter.contains(x) => NO | Maybe
    P(False Positive) > 0
    P(False Negative) = 0

    View Slide

  36. Internally :
    Bit Array of fixed size
    add(x) : for all element i, b[h(x,i)]=1
    contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
    (Boolean AND => associative)
    Both are associative => BF can be designed as a Monoid

    View Slide

  37. Bloom filters
    import com.twitter.algebird._
    import com.twitter.algebird.Operators._
    // generate 2 lists
    val A = (1 to 300).toList
    // Generate a Bloomfilter
    val NUM_HASHES = 6
    val WIDTH = 6000 // bits
    val SEED = 1
    implicit val bfm = new BloomFilterMonoid(NUM_HASHES, WIDTH, SEED)
    // approximate set with bloomfilter
    val A_bf = A.map{i => bfm.create(i.toString)}.reduce(_ + _)
    val approxBool = A_bf.contains(“150”) ---> ApproximateBoolean(true, 0.9995…)

    View Slide

  38. Count Min Sketch
    Gives an approximation of the number of occurrences of an
    element in a set.

    View Slide

  39. Count Min Sketch
    Count min sketch
    Adding an element is a numerical addition
    Querying uses a MIN function.
    Both are associative.
    useful for detecting heavy hitters, topK, LSH
    We have in Algebird :

    View Slide

  40. HyperLogLog
    Popular sketch for cardinality estimtion.
    Gives within a probilistic distribution of an error
    the number of distinct values in a data set.
    HLL.size = Approx[Number]
    Intuition
    Long runs of trailings 0 in a random bits
    chain are rare
    But the more bit chains you look at, the more
    likely you are to find a long one
    The longest run of trailing 0-bits seen can be
    an estimator of the number of unique bit chains
    observed.

    View Slide

  41. Adding an element uses a Max and Sum function.
    Both are associative and Monoids. (Max is an
    ordered
    semigroup in Algebird really)
    Querying for an element uses an harmonic mean
    which is a Monoid.
    In Algebird :

    View Slide

  42. Many More juicy sketches ...
    - MinHashes to compute Jaccard similarity
    - QTree for quantiles estimation. Neat for anomaly
    detection.
    - SpaceSaverMonoid, Awesome to find the approximate
    most frequent and top K elements.
    - TopKMonoid
    - SGD, PriorityQueues, Histograms, etc.

    View Slide

  43. SummingBird : Lamba in a box

    View Slide

  44. Heard of Lambda Architecture ?

    View Slide

  45. SummingBird
    Same code for both batch and real time processing.

    View Slide

  46. SummingBird
    Same code, for both batch and real time processing.
    But works only on Monoids.
    Uses Storehaus, as a mergeable store layer.

    View Slide

  47. http://github.com/twitter/algebird

    View Slide

  48. http://github.com/twitter/algebird

    View Slide

  49. These slides :
    http://bit.ly/1szncAZ
    http://slidesha.re/1zhhXKU

    View Slide

  50. View Slide

  51. -Algebra for analytics by Oscar Boykin (Creator of Algebird)
    http://speakerdeck.com/johnynek/algebra-for-analytics
    - Take a look into HLearn https://github.com/mikeizbicki/HLearn
    - Great intro into Algebird by Michael Noll
    http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-monoid-
    monad-for-large-scala-data-analytics/
    -Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-of-
    the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure
    - Probabilistic data structures for web analytics.
    http://highlyscalable.wordpress.com/2012/05/01/probabilistic-
    structures-web-analytics-data-mining/
    - http://debasishg.blogspot.fr/2014/01/count-min-sketch-data-
    structure-for.html
    - http://infolab.stanford.edu/~ullman/mmds/ch3.pdf
    Links

    View Slide