Slide 1

Slide 1 text

PROBABILISTIC ALGORITHMS for fun and pseudorandom profit Tyler Treat / 12.5.2015

Slide 2

Slide 2 text

ABOUT THE SPEAKER ➤ Backend engineer at Workiva ➤ Messaging platform tech lead ➤ Distributed systems ➤ bravenewgeek.com @tyler_treat
 [email protected]

Slide 3

Slide 3 text

Time Data Batch
 (days, hours) Meh, data
 (I can store this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) /dev/null

Slide 4

Slide 4 text

Time Data Batch
 (days, hours) Meh, data
 (I can store this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) Not Interesting Kinda Interesting Pretty Interesting /dev/null

Slide 5

Slide 5 text

http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

Slide 6

Slide 6 text

THIS TALK IS NOT ➤ About Samza, Storm, Spark Streaming et al. ➤ Strictly about stream-processing techniques ➤ Mathy ➤ Statistics-y

Slide 7

Slide 7 text

THIS TALK IS ➤ About basic probability theory ➤ About practical design trade-offs ➤ About algorithms & data structures ➤ About dealing with large or unbounded datasets ➤ A marriage of CS & engineering

Slide 8

Slide 8 text

OUTLINE ➤ Terminology & context ➤ Why probabilistic algorithms? ➤ Bloom filters & variants ➤ Count-min sketch ➤ HyperLogLog

Slide 9

Slide 9 text

Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input Correct result
 Gamble on speed Deterministic speed
 Gamble on result

Slide 10

Slide 10 text

Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input Correct result
 Gamble on speed Deterministic speed
 Gamble on result

Slide 11

Slide 11 text

DEFINING SOME TERMINOLOGY ➤ Online - processing elements as they arrive ➤ Offline - entire dataset is known ahead of time ➤ Real-time - hard constraint on response time ➤ A priori knowledge - something known beforehand

Slide 12

Slide 12 text

BATCH VS STREAMING ➤ Batch ➤ Offline ➤ Heuristics/multiple passes ➤ Data structures less important documents search index

Slide 13

Slide 13 text

BATCH VS STREAMING ➤ Streaming ➤ Online, one pass ➤ Usually real-time (but not necessarily) ➤ Potentially unbounded transactions caches fraud analytics

Slide 14

Slide 14 text

3 DATA INTEGRATION QUESTIONS ➤ How do you get the data? ➤ How do you disseminate the data? ➤ How do you process the data?

Slide 15

Slide 15 text

3 DATA INTEGRATION QUESTIONS ➤ How do you get the data (quickly)? ➤ How do you disseminate the data (quickly)? ➤ How do you process the data (quickly)?

Slide 16

Slide 16 text

Denormalization is critical to performance at scale.

Slide 17

Slide 17 text

How to count the number of distinct document views across Wikipedia?

Slide 18

Slide 18 text

10b531cb-914c-4b3e- ac1d-11678dd72f7a 3,042,568 16-byte GUID 8-byte integer

Slide 19

Slide 19 text

10b531cb-914c-4b3e- ac1d-11678dd72f7a 5d5d5a78-f98f-4eee- bc83-762b3c78f1ea 3558d299-45ef-4fc9- b9ec-902e4943c7f8 6febb745-c987-4c51- afd2-90a55f357d7b 6f3f199e-4cc3-4c68- 9d2a-00c31eb199f3 3,042,568 1,250,763 982,531 24,703,289 7,401,050

Slide 20

Slide 20 text

Wikipedia has ~38 million pages.

Slide 21

Slide 21 text

38,000,000 pages x (16-byte guid + 8-byte integer) ≈ 1GB

Slide 22

Slide 22 text

➤ Not unreasonable for modern hardware ➤ Held in memory for lifetime of process so will move to old GC generations—expensive to collect! ➤ Now we want to track views per unique IP address ➤ >4 billion IPv4 addresses ➤ Naive solutions quickly become intractable

Slide 23

Slide 23 text

DISTRIBUTED SYSTEMS TRADE-OFFS Consistency Availability Partition Tolerance

Slide 24

Slide 24 text

DATA PROCESSING TRADE-OFFS Time Accuracy Space

Slide 25

Slide 25 text

HAVE YOUR CAKE AND EAT IT TOO? Stream Processing Batch Processing App The “Lambda Architecture”

Slide 26

Slide 26 text

Probabilistic algorithms trade accuracy for space and performance.

Slide 27

Slide 27 text

“Sketching” data structures make this trade by storing a summary of the dataset when storing it entirely is prohibitively expensive.

Slide 28

Slide 28 text

Bloom Filters B. H. Bloom.
 Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.

Slide 29

Slide 29 text

Answers a simple question: is this element a member of a set? S ⊆ 
 x ∈ S

Slide 30

Slide 30 text

SET MEMBERSHIP ➤ Is this URL malicious? ➤ Is this IP address blacklisted? ➤ Is this word contained in the document? ➤ Is this record in the database? ➤ Has this transaction been processed?

Slide 31

Slide 31 text

Hash Table entry for each member

Slide 32

Slide 32 text

Bit Array bit for each element in universe 0101101000101110010100100110101…

Slide 33

Slide 33 text

BLOOM FILTERS ➤ Bloom filters store set memberships ➤ Answers “not in set” or “probably in set”

Slide 34

Slide 34 text

Bloom Filter Secondary Store Do you have key 1? no no Do you have key 2? Here’s key 2 yes Necessary access Here’s key 2 Do you have key 3? no yes Unnecessary access no yes no

Slide 35

Slide 35 text

BLOOM FILTERS ➤ 2 operations: add, lookup ➤ Allocate bit array of length m ➤ k hash functions ➤ Configure m and k for desired false-positive rate

Slide 36

Slide 36 text

BLOOM FILTERS ➤ Add element: ➤ Hash with k functions to get k indices ➤ Set bits at each index ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check bit at each index ➤ If any bit is unset, element not in set

Slide 37

Slide 37 text

BLOOM FILTERS ➤ Benefits: ➤ More space-efficient than hash table or bit array ➤ Can determine trade-off between accuracy and space ➤ Drawbacks: ➤ Some elements potentially more sensitive to false positives than others (solvable by partitioning) ➤ Can’t remove elements ➤ Requires a priori knowledge of the dataset ➤ Over-provisioned filter wastes space

Slide 38

Slide 38 text

Bloom filters are great for efficient offline processing, but what about streaming?

Slide 39

Slide 39 text

BLOOM FILTERS WITH A TWIST ➤ Rotating Bloom filters ➤ e.g. remember everything in the last hour
 ➤ Scalable Bloom Filters ➤ Dynamically allocating chained filters
 ➤ Stable Bloom Filters ➤ Continuously evict stale data

Slide 40

Slide 40 text

Scalable Bloom Filters P. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison.
 Scalable Bloom Filters. 2007.

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

No content

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

l P0 P0 P0 P0 P0 = error prob. of 1 filter l = # filters P = compound error prob. P = 1 - (1 - P 0 ) i=0 l-1

Slide 47

Slide 47 text

P0 = 0.1 P = 1 - (1 - P0) i=0 l-1

Slide 48

Slide 48 text

SCALABLE BLOOM FILTERS ➤ Questions: ➤ When to add a new filter? ➤ How to place a tight upper bound on P?

Slide 49

Slide 49 text

SCALABLE BLOOM FILTERS ➤ When to add a new filter? ➤ Fill ratio p = # set bits / # bits ➤ Add new filter when target p is reached ➤ Optimal target p = 0.5 (math follows from paper)

Slide 50

Slide 50 text

SCALABLE BLOOM FILTERS ➤ How to place a tight upper bound on P? ➤ Apply tightening ratio r to P0 , where 0 < r < 1 ➤ Start with 1 filter, error probability P0 ➤ When full, add new filter, error probability P1 =P0 r ➤ Results in geometric series: ➤ Series converges on target error probability P

Slide 51

Slide 51 text

P0 = 0.1
 r = 0.5 P P = 1 - (1 - P0r i ) i=0 l-1

Slide 52

Slide 52 text

SCALABLE BLOOM FILTERS ➤ Add elements to last filter ➤ Check each filter on lookups ➤ Tightening ratio r controls m and k for new filters

Slide 53

Slide 53 text

SCALABLE BLOOM FILTERS ➤ Benefits: ➤ Can grow dynamically to accommodate dataset ➤ Provides tight upper bound on false-positive rate ➤ Can control growth rate ➤ Drawbacks: ➤ Size still proportional to dataset ➤ Additional computation on adds (negligible amortized)

Slide 54

Slide 54 text

Stable Bloom Filters F. Deng, D. Rafiei.
 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.

Slide 55

Slide 55 text

DUPLICATE DETECTION ➤ Query processing ➤ URL crawling ➤ Monitoring distinct IP addresses ➤ Advertiser click streams ➤ Graph processing

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

Bloom filters are remarkably useful for dealing with graph data.

Slide 58

Slide 58 text

GRAPH PROCESSING ➤ Detecting cycles ➤ Pruning search space ➤ E.g. often used in bioinformatics ➤ Storing chemical structures, properties, and molecular fingerprints in filters to optimize searches and determine structural similarities ➤ Rapid classification of DNA sequences as large as the human genome

Slide 59

Slide 59 text

GRAPH PROCESSING ➤ Store crawled nodes in memory ➤ Set of nodes may be too large to fit in memory ➤ Store crawled nodes in secondary storage ➤ Too many searches to perform in limited time

Slide 60

Slide 60 text

Precisely eliminating duplicates in an unbounded stream isn’t feasible with limited space and time.

Slide 61

Slide 61 text

Efficacy/Efficiency Conjecture:
 In many situations, a quick answer with an allowable error rate is better than a precise one that is slow.

Slide 62

Slide 62 text

Staleness Conjecture:
 In many situations, more recent data has more value than stale data.

Slide 63

Slide 63 text

STABLE BLOOM FILTERS ➤ Discards old data to make room for new data ➤ Replace bit array with array of d-bit counters ➤ Initialize counters to zero ➤ Maximum counter value Max = 2d - 1

Slide 64

Slide 64 text

STABLE BLOOM FILTERS ➤ Add element: ➤ Select P random counters and decrement by one ➤ Hash with k functions to get k indices ➤ Set counters at each index to Max ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check counter at each index ➤ If any counter is zero, element not in set

Slide 65

Slide 65 text

STABLE BLOOM FILTERS ➤ Classic Bloom filter a special case of SBF w/ d=1, P=0 ➤ Tight upper bound on false positives ➤ FP rate asymptotically approaches configurable fixed constant (stable-point property) ➤ See paper for math and parameter settings ➤ Evicting data introduces false negatives

Slide 66

Slide 66 text

STABLE BLOOM FILTERS ➤ Benefits: ➤ Fixed memory allocation ➤ Evicts old data to make room for new data ➤ Provides tight upper bound on false positives ➤ Drawbacks: ➤ Introduces false negatives ➤ Additional computation on adds

Slide 67

Slide 67 text

Count-Min Sketch G. Cormode, S. Muthukrishnan.
 An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.

Slide 68

Slide 68 text

Can we count element frequencies using sub-linear space? page views 94.136.205.1 132.208.90.15 54.222.151.15 7 4 11

Slide 69

Slide 69 text

COUNT-MIN SKETCH ➤ Approximates frequencies in sub-linear space ➤ Matrix with w columns and d rows ➤ Each row has a hash function ➤ Each cell initialized to zero ➤ When element arrives: ➤ Hash for each row ➤ Increment each counter by 1 ➤ freq(element) = min counter value

Slide 70

Slide 70 text

COUNT-MIN SKETCH ➤ Why the minimum? ➤ Possibility for collisions between elements ➤ Counter may be incremented by multiple elements ➤ Taking minimum counter value gives closer approximation

Slide 71

Slide 71 text

COUNT-MIN SKETCH ➤ Benefits: ➤ Simple! ➤ Sub-linear space ➤ Useful for detecting “heavy hitters” ➤ Easy to track top-k by adding a min-heap ➤ Drawbacks: ➤ Biased estimator: may overestimate, never underestimates ➤ Better suited to Zipfian distributions & rare events

Slide 72

Slide 72 text

HyperLogLog P. Flajolet, É. Fusy, O. Gandouet, F. Meunier.
 HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.

Slide 73

Slide 73 text

How do we count distinct things in a stream?

Slide 74

Slide 74 text

COUNTING PROBLEMS ➤ E.g. how many different words are used in Wikipedia? ➤ Counter per element explodes memory ➤ Usually requires memory proportional to cardinality ➤ Can we approximate cardinality with constant space?

Slide 75

Slide 75 text

HYPERLOGLOG ➤ The name: can estimate cardinality of set w/ cardinality Nmax using loglog(Nmax ) + O(1) bits ➤ Hash element to integer ➤ Count number of leading 0’s in binary form of hash ➤ Track highest number of leading 0’s, n ➤ Cardinality ≈ 2n+1

Slide 76

Slide 76 text

HYPERLOGLOG ➤ stream = [“foo”, “bar”, “baz”, “qux”] ➤ h(“foo”) = 10100001 ➤ h(“bar”) = 01110111 ➤ h(“baz”) = 01110100 ➤ h(“qux”) = 10100011 ➤ n = 1 ➤ |stream| ≈ 2n+1 = 22 = 4

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

It’s actually not magic but just a few really clever observations.

Slide 79

Slide 79 text

With 50/50 odds, how long will it take to flip 3 heads in a row? 20? 100?

Slide 80

Slide 80 text

HYPERLOGLOG ➤ Replace “heads” and “tails” with 0’s and 1’s ➤ Count leading consecutive 0’s in binary form of hash ➤ E.g. imagine a 4-bit hash, 16 possible values: ➤ 0000 4 leading 0’s ➤ 0001 3 leading 0’s ➤ 0011, 0010 2 leading 0’s ➤ 0100, 0111, 0110, 0101 1 leading 0’s ➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s ➤ Assume good hash function → 1/16 odds for each permutation

Slide 81

Slide 81 text

HYPERLOGLOG ➤ Track highest number of leading 0’s, n ➤ n = 0 → 8/16=1/2 odds ➤ n = 1 → 4/16=1/4 odds ➤ n = 2 → 2/16=1/8 odds ➤ n = 3 → 1/16 odds ➤ Cardinality ≈ how many things did we have to look? ➤ E.g. highest count = 1 → 1/4 odds → cardinality 4

Slide 82

Slide 82 text

No content

Slide 83

Slide 83 text

HYPERLOGLOG ➤ 1/2 of all binary numbers start with 1 ➤ Each additional bit cuts the probability in half: ➤ 1/4 start with 01 ➤ 1/8 start with 001 ➤ 1/16 start with 0001 ➤ etc. ➤ P(run of length n) = 1 / 2n+1 ➤ Seeing 001 has 1/8 probability, meaning we had to look at approximately 8 things til we saw it (cardinality 8) ➤ Cardinality ≈ prob-1 (reciprocal of probability)

Slide 84

Slide 84 text

What about outliers?

Slide 85

Slide 85 text

HYPERLOGLOG ➤ Use multiple buckets ➤ Use first few bits of hash to determine bucket ➤ Use remaining bits to count 0’s ➤ Each bucket tracks its own count ➤ Take harmonic mean of all buckets to get cardinality ➤ min(x1 …xn ) ≤ H(x1 …xn ) ≤ n min(x1 …xn ) 01011010001011100101001001101010 bucket counting space

Slide 86

Slide 86 text

http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html Number of distinct words in all of Shakespeare's work

Slide 87

Slide 87 text

HYPERLOGLOG ➤ Benefits: ➤ Constant memory ➤ Super fast (calculating MSB is cheap) ➤ Can give accurate count with <1% error ➤ Drawbacks: ➤ Has a margin of error (albeit small)

Slide 88

Slide 88 text

What did we learn?

Slide 89

Slide 89 text

Data processing has trade-offs.

Slide 90

Slide 90 text

Probabilistic algorithms trade accuracy for speed and space.

Slide 91

Slide 91 text

Often we only care about answers that are mostly correct but available now.

Slide 92

Slide 92 text

Sometimes the “right” answer is impossible to compute or simply doesn’t exist.

Slide 93

Slide 93 text

But mostly…

Slide 94

Slide 94 text

Probabilistic algorithms are just damn cool.

Slide 95

Slide 95 text

What about the code?

Slide 96

Slide 96 text

ALGORITHM IMPLEMENTATIONS ➤ Algebird - https://github.com/twitter/algebird ➤ Bloom filter ➤ Count-min sketch ➤ HyperLogLog ➤ stream-lib - https://github.com/addthis/stream-lib ➤ Bloom filter ➤ Count-min sketch ➤ HyperLogLog ➤ Boom Filters - https://github.com/tylertreat/BoomFilters ➤ Bloom filter ➤ Scalable Bloom filter ➤ Stable Bloom filter ➤ Count-min sketch ➤ HyperLogLog

Slide 97

Slide 97 text

OTHER COOL PROBABILISTIC ALGORITHMS ➤ Counting Bloom filter (and many other Bloom variations) ➤ Bloomier filter (encode functions instead of sets) ➤ Cuckoo filter (Bloom filter w/ cuckoo hashing) ➤ q-digest (quantile approximation) ➤ t-digest (online accumulation of rank-based statistics) ➤ Locality-sensitive hashing (hash similar items to same buckets) ➤ MinHash (set similarity) ➤ Miller–Rabin (primality testing) ➤ Karger’s algorithm (min cut of connected graph)

Slide 98

Slide 98 text

@tyler_treat github.com/tylertreat bravenewgeek.com Thanks We’re hiring!

Slide 99

Slide 99 text

BIBLIOGRAPHY Almeida, P ., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/ members/cbm/ps/dbloom.pdf Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/ ~diaz/p422-bloom.pdf Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Flajolet, P ., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/ 26/13/1595.full.pdf Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/ TheoryandPracticeBloomFilter2011Tarkoma.pdf Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream- processing-and-probabilistic-methods