Pro Yearly is on sale from $80 to $50! »

Algebra for analytics

Algebra for analytics

Open world Forum 2014
Paris ML Meetup

50c1b0fe4cdb0e8e7992d6872cf6cfd7?s=128

Sam Bessalah

October 30, 2014
Tweet

Transcript

  1. Abstract Algebra for Analytics Sam BESSALAH @samklr

  2. None
  3. None
  4. What do we want? • We want to build scalable

    systems. • Preferably by leveraging distributed computing • A lot of analytics amount to counting or adding in some sort of way.
  5. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output 11, 12, 0,3,56,48 K=3 56,48,12
  6. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output Hadoop Map-Reduce
  7. • Example : Finding TopK Elements Read Input Sort, Filter

    and take top K records Write Output Hadoop Map-Reduce
  8. In Scalding

  9. In Scalding

  10. Problems • Curse of the last reducer • Network Chatter,

    hinder on performance • Inefficient Order for map and reduce steps • Multiple jobs, with a sync barrier at the reducer
  11. But in Scalding, « sortWithTake » uses :

  12. But in Scalding, « sortWithTake » uses : Priority Queue

    Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
  13. But in Scalding, « sortWithTake » uses : Priority Queue

    Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45 In a single Pass
  14. Why is it better and faster?

  15. Associativity allows parallelism

  16. None
  17. Do we have data structures that are intrinsically parallelizable?

  18. Abstract Algebra Redux • Semi Group Associative Set (Grouping doesn’t

    matter) • Monoid Semi Group with a zero (Zeros get ignored) • Group Monoid with inverse • Abelian Group Commutative Set (ordering doesn’t matter)
  19. None
  20. None
  21. Stream mining challenges • Update predictions after every observation •

    Single pass : can’t read old data or replay the stream • Limited time for computation per observation • O(n) memory size
  22. Existing solutions • Knuth’s Reservoir Sampling works on evolving stream

    of data and in fixed memory. • Stream subsampling • Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees • Use time series analysis methods … • Etc
  23. Approximate algorithms for stream analytics

  24. Idea : Hash, don’t Sample

  25. Bloom filters • Approximate data structure for set membership •

    Like an approximate set BloomFilter.contains(x) => Maybe | NO P(False Positive) > 0 P(False Negative) = 0
  26. • Bit Array of fixed size add(x) : for all

    element i, b[h(x,i)]=1 contains(x) : TRUE if b[h(x,i)] = = 1 for all i.
  27. None
  28. None
  29. • Bloom Filters Adding an element uses a boolean OR

    Querying uses a boolean AND Both are Monoids
  30. HyperLogLogard

  31. Intuition • Long runs of trailings 0 in a random

    bits chain are rare • But the more bit chains you look at, the more likely you are to find a long one • The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
  32. HyperLogLog • Popular sketch for cardinality estimation HLL.size = Approx[Number]

    We know the distribution on the error.
  33. None
  34. None
  35. http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

  36. • HyperLogLog Adding an element uses MAX, which is a

    monoid (Ordered Semi Group really ...) Querying use an harmonic sum : Monoid.
  37. Min Hash • Gives the probability of two sets being

    similar. • Essentially amounts to P(A ∩ B) / P(A U B) • Jaccard Similarity
  38. None
  39. Count min Sketch Gives an approximation of the number of

    occurrences of an element in a set.
  40. • Count min sketch Adding an element is a numerical

    addition Querying uses a MIN function. Both are associative.
  41. Anomaly Detection

  42. - Online Summarizer : Approximate data structure to find quantiles

    in a continuous stream of data. - Many exist : Q-Tree, Q-Digest, T-Digest - All of those are associative. - Another neat thing : types your data uniformaly.
  43. Many more sketches and tricks • FM Counters, KMV •

    Histograms • Ball Sketches : streaming k-means, clustering • SGD : fit online machine learning algorithms
  44. None
  45. Algebird

  46. Conclusion • Hashed data structures can be resolved to usual

    data structures like Set, Map, etc which are easier to reason about as developers • As data size grows, sampling becomes painful, hashing provide better cost effective solution • Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. http://speakerdeck.com/samklr
  47. DON’T BE SCARED ANYMORE.

  48. Bibliography • Great intro into Algebird http://www.michael-noll.com/blog/2013/12/02/twitter-algebird- monoid-monad-for-large-scala-data-analytics/ • Aggregate

    Knowledge http://research.neustar.biz/2012/10/25/sketch- of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ • Probabilistic data structures for web analytics. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures- web-analytics-data-mining/ Algebird : github.com/twitter/algebird Algebra for analytics https://speakerdeck.com/johnynek/algebra-for- analytics http://infolab.stanford.edu/~ullman/mmds/ch3.pdf