Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algebra for analytics

Algebra for analytics

Open world Forum 2014
Paris ML Meetup

Sam Bessalah

October 30, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. Abstract Algebra for
    Analytics
    Sam BESSALAH
    @samklr

    View Slide

  2. View Slide

  3. View Slide

  4. What do we want?
    • We want to build scalable systems.
    • Preferably by leveraging distributed computing
    • A lot of analytics amount to counting or adding in
    some sort of way.

    View Slide

  5. • Example : Finding TopK Elements
    Read Input
    Sort, Filter and
    take top K records
    Write Output
    11, 12, 0,3,56,48 K=3 56,48,12

    View Slide

  6. • Example : Finding TopK Elements
    Read Input
    Sort, Filter and
    take top K records
    Write Output
    Hadoop Map-Reduce

    View Slide

  7. • Example : Finding TopK Elements
    Read Input
    Sort, Filter and
    take top K records
    Write Output
    Hadoop Map-Reduce

    View Slide

  8. In Scalding

    View Slide

  9. In Scalding

    View Slide

  10. Problems
    • Curse of the last reducer
    • Network Chatter, hinder on performance
    • Inefficient Order for map and reduce steps
    • Multiple jobs, with a sync barrier at the
    reducer

    View Slide

  11. But in Scalding, « sortWithTake » uses :

    View Slide

  12. But in Scalding, « sortWithTake » uses :
    Priority Queue
    Can be empty
    Two Priority Queues can be added in any order
    Associative + Commutative
    PQ1 : 55, 45, 21, 3
    PQ2: 100, 80, 40, 3
    K = 4
    PQ1 (+) PQ2 : 100, 80, 55, 45

    View Slide

  13. But in Scalding, « sortWithTake » uses :
    Priority Queue
    Can be empty
    Two Priority Queues can be added in any order
    Associative + Commutative
    PQ1 : 55, 45, 21, 3
    PQ2: 100, 80, 40, 3
    K = 4
    PQ1 (+) PQ2 : 100, 80, 55, 45
    In a single Pass

    View Slide

  14. Why is it better and faster?

    View Slide

  15. Associativity allows parallelism

    View Slide

  16. View Slide

  17. Do we have data structures that
    are intrinsically parallelizable?

    View Slide

  18. Abstract Algebra Redux
    • Semi Group
    Associative Set (Grouping doesn’t matter)
    • Monoid
    Semi Group with a zero (Zeros get ignored)
    • Group
    Monoid with inverse
    • Abelian Group
    Commutative Set (ordering doesn’t matter)

    View Slide

  19. View Slide

  20. View Slide

  21. Stream mining challenges
    • Update predictions after every observation
    • Single pass : can’t read old data or replay the
    stream
    • Limited time for computation per observation
    • O(n) memory size

    View Slide

  22. Existing solutions
    • Knuth’s Reservoir Sampling works on evolving
    stream of data and in fixed memory.
    • Stream subsampling
    • Adaptive sliding windows : build decision trees on
    these windows, e.g Hoeffding Trees
    • Use time series analysis methods …
    • Etc

    View Slide

  23. Approximate algorithms for
    stream analytics

    View Slide

  24. Idea :
    Hash, don’t Sample

    View Slide

  25. Bloom filters
    • Approximate data structure for set membership
    • Like an approximate set
    BloomFilter.contains(x) => Maybe | NO
    P(False Positive) > 0
    P(False Negative) = 0

    View Slide

  26. • Bit Array of fixed size
    add(x) : for all element i, b[h(x,i)]=1
    contains(x) : TRUE if b[h(x,i)] = = 1 for all i.

    View Slide

  27. View Slide

  28. View Slide

  29. • Bloom Filters
    Adding an element uses a boolean OR
    Querying uses a boolean AND
    Both are Monoids

    View Slide

  30. HyperLogLogard

    View Slide

  31. Intuition
    • Long runs of trailings 0 in a random
    bits chain are rare
    • But the more bit chains you look at,
    the more likely you are to find a long
    one
    • The longest run of trailing 0-bits seen
    can be an estimator of the number of
    unique bit chains observed.

    View Slide

  32. HyperLogLog
    • Popular sketch for cardinality estimation
    HLL.size = Approx[Number]
    We know the distribution on the error.

    View Slide

  33. View Slide

  34. View Slide

  35. http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

    View Slide

  36. • HyperLogLog
    Adding an element uses MAX, which is a
    monoid (Ordered Semi Group really ...)
    Querying use an harmonic sum : Monoid.

    View Slide

  37. Min Hash
    • Gives the probability of two sets being similar.
    • Essentially amounts to
    P(A ∩ B) / P(A U B)
    • Jaccard Similarity

    View Slide

  38. View Slide

  39. Count min Sketch
    Gives an approximation of the number of occurrences of an element in a set.

    View Slide

  40. • Count min sketch
    Adding an element is a numerical addition
    Querying uses a MIN function.
    Both are associative.

    View Slide

  41. Anomaly Detection

    View Slide

  42. - Online Summarizer : Approximate data
    structure to find quantiles in a continuous
    stream of data.
    - Many exist : Q-Tree, Q-Digest, T-Digest
    - All of those are associative.
    - Another neat thing : types your data
    uniformaly.

    View Slide

  43. Many more sketches and tricks
    • FM Counters, KMV
    • Histograms
    • Ball Sketches : streaming k-means, clustering
    • SGD : fit online machine learning algorithms

    View Slide

  44. View Slide

  45. Algebird

    View Slide

  46. Conclusion
    • Hashed data structures can be resolved to usual data
    structures like Set, Map, etc which are easier to reason
    about as developers
    • As data size grows, sampling becomes painful, hashing
    provide better cost effective solution
    • Abstract algebra with skecthed data is a no brainer, and
    garantees less error and better scalability of analytics
    systems.
    http://speakerdeck.com/samklr

    View Slide

  47. DON’T BE SCARED ANYMORE.

    View Slide

  48. Bibliography
    • Great intro into Algebird
    http://www.michael-noll.com/blog/2013/12/02/twitter-algebird-
    monoid-monad-for-large-scala-data-analytics/
    • Aggregate Knowledge http://research.neustar.biz/2012/10/25/sketch-
    of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
    • Probabilistic data structures for web analytics.
    http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-
    web-analytics-data-mining/
    Algebird : github.com/twitter/algebird
    Algebra for analytics https://speakerdeck.com/johnynek/algebra-for-
    analytics
    http://infolab.stanford.edu/~ullman/mmds/ch3.pdf

    View Slide