Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algebra for analytics

Algebra for analytics

Open world Forum 2014
Paris ML Meetup

Sam Bessalah

October 30, 2014
Tweet

More Decks by Sam Bessalah

Other Decks in Technology

Transcript

  1. What do we want? • We want to build scalable

    systems. • Preferably by leveraging distributed computing • A lot of analytics amount to counting or adding in some sort of way.
  2. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output 11, 12, 0,3,56,48 K=3 56,48,12
  3. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output Hadoop Map-Reduce
  4. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output Hadoop Map-Reduce
  5. Problems • Curse of the last reducer • Network Chatter,

    hinder on performance • Inefficient Order for map and reduce steps • Multiple jobs, with a sync barrier at the reducer
  6. But in Scalding, « sortWithTake » uses : Priority Queue

    Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
  7. But in Scalding, « sortWithTake » uses : Priority Queue

    Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45 In a single Pass
  8. Abstract Algebra Redux • Semi Group Associative Set (Grouping doesn’t

    matter) • Monoid Semi Group with a zero (Zeros get ignored) • Group Monoid with inverse • Abelian Group Commutative Set (ordering doesn’t matter)
  9. Stream mining challenges • Update predictions after every observation •

    Single pass : can’t read old data or replay the stream • Limited time for computation per observation • O(n) memory size
  10. Existing solutions • Knuth’s Reservoir Sampling works on evolving stream

    of data and in fixed memory. • Stream subsampling • Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees • Use time series analysis methods … • Etc
  11. Bloom filters • Approximate data structure for set membership •

    Like an approximate set BloomFilter.contains(x) => Maybe | NO P(False Positive) > 0 P(False Negative) = 0
  12. • Bit Array of fixed size add(x) : for all

    element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
  13. • Bloom Filters Adding an element uses a boolean OR

    Querying uses a boolean AND Both are Monoids
  14. Intuition • Long runs of trailings 0 in a random

    bits chain are rare • But the more bit chains you look at, the more likely you are to find a long one • The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
  15. • HyperLogLog Adding an element uses MAX, which is a

    monoid (Ordered Semi Group really ...) Querying use an harmonic sum : Monoid.
  16. Min Hash • Gives the probability of two sets being

    similar. • Essentially amounts to P(A ∩ B) / P(A U B) • Jaccard Similarity
  17. Count min Sketch Gives an approximation of the number of

    occurrences of an element in a set.
  18. • Count min sketch Adding an element is a numerical

    addition Querying uses a MIN function. Both are associative.
  19. - Online Summarizer : Approximate data structure to find quantiles

    in a continuous stream of data. - Many exist : Q-Tree, Q-Digest, T-Digest - All of those are associative. - Another neat thing : types your data uniformaly.
  20. Many more sketches and tricks • FM Counters, KMV •

    Histograms • Ball Sketches : streaming k-means, clustering • SGD : fit online machine learning algorithms
  21. Conclusion • Hashed data structures can be resolved to usual

    data structures like Set, Map, etc which are easier to reason about as developers • As data size grows, sampling becomes painful, hashing provide better cost effective solution • Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. http://speakerdeck.com/samklr
  22. Bibliography • Great intro into Algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ • Aggregate

    Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ • Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ Algebird : github.com/twitter/algebird Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics http://infolab.stanford.edu/~ullman/mmds/ch3.pdf