What do we want? • We want to build scalable systems. • Preferably by leveraging distributed computing • A lot of analytics amount to counting or adding in some sort of way.
Problems • Curse of the last reducer • Network Chatter, hinder on performance • Inefficient Order for map and reduce steps • Multiple jobs, with a sync barrier at the reducer
But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45
But in Scalding, « sortWithTake » uses : Priority Queue Can be empty Two Priority Queues can be added in any order Associative + Commutative PQ1 : 55, 45, 21, 3 PQ2: 100, 80, 40, 3 K = 4 PQ1 (+) PQ2 : 100, 80, 55, 45 In a single Pass
Abstract Algebra Redux • Semi Group Associative Set (Grouping doesn’t matter) • Monoid Semi Group with a zero (Zeros get ignored) • Group Monoid with inverse • Abelian Group Commutative Set (ordering doesn’t matter)
Stream mining challenges • Update predictions after every observation • Single pass : can’t read old data or replay the stream • Limited time for computation per observation • O(n) memory size
Existing solutions • Knuth’s Reservoir Sampling works on evolving stream of data and in fixed memory. • Stream subsampling • Adaptive sliding windows : build decision trees on these windows, e.g Hoeffding Trees • Use time series analysis methods … • Etc
Bloom filters • Approximate data structure for set membership • Like an approximate set BloomFilter.contains(x) => Maybe | NO P(False Positive) > 0 P(False Negative) = 0
Intuition • Long runs of trailings 0 in a random bits chain are rare • But the more bit chains you look at, the more likely you are to find a long one • The longest run of trailing 0-bits seen can be an estimator of the number of unique bit chains observed.
- Online Summarizer : Approximate data structure to find quantiles in a continuous stream of data. - Many exist : Q-Tree, Q-Digest, T-Digest - All of those are associative. - Another neat thing : types your data uniformaly.
Conclusion • Hashed data structures can be resolved to usual data structures like Set, Map, etc which are easier to reason about as developers • As data size grows, sampling becomes painful, hashing provide better cost effective solution • Abstract algebra with skecthed data is a no brainer, and garantees less error and better scalability of analytics systems. http://speakerdeck.com/samklr