Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistic algorithms for fun and pseudorandom profit

Dcbf01e42178cd9698fb3d4806e33d84?s=47 Tyler Treat
December 05, 2015

Probabilistic algorithms for fun and pseudorandom profit

There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.

Dcbf01e42178cd9698fb3d4806e33d84?s=128

Tyler Treat

December 05, 2015
Tweet

Transcript

  1. PROBABILISTIC ALGORITHMS for fun and pseudorandom profit Tyler Treat /

    12.5.2015
  2. ABOUT THE SPEAKER ➤ Backend engineer at Workiva ➤ Messaging

    platform tech lead ➤ Distributed systems ➤ bravenewgeek.com @tyler_treat
 tyler.treat@workiva.com
  3. Time Data Batch
 (days, hours) Meh, data
 (I can store

    this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) /dev/null
  4. Time Data Batch
 (days, hours) Meh, data
 (I can store

    this on
 my laptop) Streaming
 (minutes, seconds) Oi, data!
 (We’re gonna need
 a bigger boat…) Real-Time™
 (I need it now, dammit!) Big data™
 (IoT, sensors) Not Interesting Kinda Interesting Pretty Interesting /dev/null
  5. http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

  6. THIS TALK IS NOT ➤ About Samza, Storm, Spark Streaming

    et al. ➤ Strictly about stream-processing techniques ➤ Mathy ➤ Statistics-y
  7. THIS TALK IS ➤ About basic probability theory ➤ About

    practical design trade-offs ➤ About algorithms & data structures ➤ About dealing with large or unbounded datasets ➤ A marriage of CS & engineering
  8. OUTLINE ➤ Terminology & context ➤ Why probabilistic algorithms? ➤

    Bloom filters & variants ➤ Count-min sketch ➤ HyperLogLog
  9. Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input

    Correct result
 Gamble on speed Deterministic speed
 Gamble on result
  10. Randomized Algorithms Las Vegas Algorithms Monte Carlo Algorithms Random Input

    Correct result
 Gamble on speed Deterministic speed
 Gamble on result
  11. DEFINING SOME TERMINOLOGY ➤ Online - processing elements as they

    arrive ➤ Offline - entire dataset is known ahead of time ➤ Real-time - hard constraint on response time ➤ A priori knowledge - something known beforehand
  12. BATCH VS STREAMING ➤ Batch ➤ Offline ➤ Heuristics/multiple passes

    ➤ Data structures less important documents search index
  13. BATCH VS STREAMING ➤ Streaming ➤ Online, one pass ➤

    Usually real-time (but not necessarily) ➤ Potentially unbounded transactions caches fraud analytics
  14. 3 DATA INTEGRATION QUESTIONS ➤ How do you get the

    data? ➤ How do you disseminate the data? ➤ How do you process the data?
  15. 3 DATA INTEGRATION QUESTIONS ➤ How do you get the

    data (quickly)? ➤ How do you disseminate the data (quickly)? ➤ How do you process the data (quickly)?
  16. Denormalization is critical to performance at scale.

  17. How to count the number of distinct document views across

    Wikipedia?
  18. 10b531cb-914c-4b3e- ac1d-11678dd72f7a 3,042,568 16-byte GUID 8-byte integer

  19. 10b531cb-914c-4b3e- ac1d-11678dd72f7a 5d5d5a78-f98f-4eee- bc83-762b3c78f1ea 3558d299-45ef-4fc9- b9ec-902e4943c7f8 6febb745-c987-4c51- afd2-90a55f357d7b 6f3f199e-4cc3-4c68- 9d2a-00c31eb199f3

    3,042,568 1,250,763 982,531 24,703,289 7,401,050
  20. Wikipedia has ~38 million pages.

  21. 38,000,000 pages x (16-byte guid + 8-byte integer) ≈ 1GB

  22. ➤ Not unreasonable for modern hardware ➤ Held in memory

    for lifetime of process so will move to old GC generations—expensive to collect! ➤ Now we want to track views per unique IP address ➤ >4 billion IPv4 addresses ➤ Naive solutions quickly become intractable
  23. DISTRIBUTED SYSTEMS TRADE-OFFS Consistency Availability Partition Tolerance

  24. DATA PROCESSING TRADE-OFFS Time Accuracy Space

  25. HAVE YOUR CAKE AND EAT IT TOO? Stream Processing Batch

    Processing App The “Lambda Architecture”
  26. Probabilistic algorithms trade accuracy for space and performance.

  27. “Sketching” data structures make this trade by storing a summary

    of the dataset when storing it entirely is prohibitively expensive.
  28. Bloom Filters B. H. Bloom.
 Space/Time Trade-offs in Hash Coding

    with Allowable Errors. 1970.
  29. Answers a simple question: is this element a member of

    a set? S ⊆ 
 x ∈ S
  30. SET MEMBERSHIP ➤ Is this URL malicious? ➤ Is this

    IP address blacklisted? ➤ Is this word contained in the document? ➤ Is this record in the database? ➤ Has this transaction been processed?
  31. Hash Table entry for each member

  32. Bit Array bit for each element in universe 0101101000101110010100100110101…

  33. BLOOM FILTERS ➤ Bloom filters store set memberships ➤ Answers

    “not in set” or “probably in set”
  34. Bloom Filter Secondary Store Do you have key 1? no

    no Do you have key 2? Here’s key 2 yes Necessary access Here’s key 2 Do you have key 3? no yes Unnecessary access no yes no
  35. BLOOM FILTERS ➤ 2 operations: add, lookup ➤ Allocate bit

    array of length m ➤ k hash functions ➤ Configure m and k for desired false-positive rate
  36. BLOOM FILTERS ➤ Add element: ➤ Hash with k functions

    to get k indices ➤ Set bits at each index ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check bit at each index ➤ If any bit is unset, element not in set
  37. BLOOM FILTERS ➤ Benefits: ➤ More space-efficient than hash table

    or bit array ➤ Can determine trade-off between accuracy and space ➤ Drawbacks: ➤ Some elements potentially more sensitive to false positives than others (solvable by partitioning) ➤ Can’t remove elements ➤ Requires a priori knowledge of the dataset ➤ Over-provisioned filter wastes space
  38. Bloom filters are great for efficient offline processing, but what

    about streaming?
  39. BLOOM FILTERS WITH A TWIST ➤ Rotating Bloom filters ➤

    e.g. remember everything in the last hour
 ➤ Scalable Bloom Filters ➤ Dynamically allocating chained filters
 ➤ Stable Bloom Filters ➤ Continuously evict stale data
  40. Scalable Bloom Filters P. S. Almeida, C. Baquero, N. Preguiça,

    D. Hutchison.
 Scalable Bloom Filters. 2007.
  41. None
  42. None
  43. None
  44. None
  45. None
  46. l P0 P0 P0 P0 P0 = error prob. of

    1 filter l = # filters P = compound error prob. P = 1 - (1 - P 0 ) i=0 l-1
  47. P0 = 0.1 P = 1 - (1 - P0)

    i=0 l-1
  48. SCALABLE BLOOM FILTERS ➤ Questions: ➤ When to add a

    new filter? ➤ How to place a tight upper bound on P?
  49. SCALABLE BLOOM FILTERS ➤ When to add a new filter?

    ➤ Fill ratio p = # set bits / # bits ➤ Add new filter when target p is reached ➤ Optimal target p = 0.5 (math follows from paper)
  50. SCALABLE BLOOM FILTERS ➤ How to place a tight upper

    bound on P? ➤ Apply tightening ratio r to P0 , where 0 < r < 1 ➤ Start with 1 filter, error probability P0 ➤ When full, add new filter, error probability P1 =P0 r ➤ Results in geometric series: ➤ Series converges on target error probability P
  51. P0 = 0.1
 r = 0.5 P P = 1

    - (1 - P0r i ) i=0 l-1
  52. SCALABLE BLOOM FILTERS ➤ Add elements to last filter ➤

    Check each filter on lookups ➤ Tightening ratio r controls m and k for new filters
  53. SCALABLE BLOOM FILTERS ➤ Benefits: ➤ Can grow dynamically to

    accommodate dataset ➤ Provides tight upper bound on false-positive rate ➤ Can control growth rate ➤ Drawbacks: ➤ Size still proportional to dataset ➤ Additional computation on adds (negligible amortized)
  54. Stable Bloom Filters F. Deng, D. Rafiei.
 Approximately Detecting Duplicates

    for Streaming Data using Stable Bloom Filters. 2006.
  55. DUPLICATE DETECTION ➤ Query processing ➤ URL crawling ➤ Monitoring

    distinct IP addresses ➤ Advertiser click streams ➤ Graph processing
  56. None
  57. Bloom filters are remarkably useful for dealing with graph data.

  58. GRAPH PROCESSING ➤ Detecting cycles ➤ Pruning search space ➤

    E.g. often used in bioinformatics ➤ Storing chemical structures, properties, and molecular fingerprints in filters to optimize searches and determine structural similarities ➤ Rapid classification of DNA sequences as large as the human genome
  59. GRAPH PROCESSING ➤ Store crawled nodes in memory ➤ Set

    of nodes may be too large to fit in memory ➤ Store crawled nodes in secondary storage ➤ Too many searches to perform in limited time
  60. Precisely eliminating duplicates in an unbounded stream isn’t feasible with

    limited space and time.
  61. Efficacy/Efficiency Conjecture:
 In many situations, a quick answer with an

    allowable error rate is better than a precise one that is slow.
  62. Staleness Conjecture:
 In many situations, more recent data has more

    value than stale data.
  63. STABLE BLOOM FILTERS ➤ Discards old data to make room

    for new data ➤ Replace bit array with array of d-bit counters ➤ Initialize counters to zero ➤ Maximum counter value Max = 2d - 1
  64. STABLE BLOOM FILTERS ➤ Add element: ➤ Select P random

    counters and decrement by one ➤ Hash with k functions to get k indices ➤ Set counters at each index to Max ➤ Lookup: ➤ Hash with k functions to get k indices ➤ Check counter at each index ➤ If any counter is zero, element not in set
  65. STABLE BLOOM FILTERS ➤ Classic Bloom filter a special case

    of SBF w/ d=1, P=0 ➤ Tight upper bound on false positives ➤ FP rate asymptotically approaches configurable fixed constant (stable-point property) ➤ See paper for math and parameter settings ➤ Evicting data introduces false negatives
  66. STABLE BLOOM FILTERS ➤ Benefits: ➤ Fixed memory allocation ➤

    Evicts old data to make room for new data ➤ Provides tight upper bound on false positives ➤ Drawbacks: ➤ Introduces false negatives ➤ Additional computation on adds
  67. Count-Min Sketch G. Cormode, S. Muthukrishnan.
 An Improved Data Stream

    Summary: The Count-Min Sketch and its Applications. 2003.
  68. Can we count element frequencies using sub-linear space? page views

    94.136.205.1 132.208.90.15 54.222.151.15 7 4 11
  69. COUNT-MIN SKETCH ➤ Approximates frequencies in sub-linear space ➤ Matrix

    with w columns and d rows ➤ Each row has a hash function ➤ Each cell initialized to zero ➤ When element arrives: ➤ Hash for each row ➤ Increment each counter by 1 ➤ freq(element) = min counter value
  70. COUNT-MIN SKETCH ➤ Why the minimum? ➤ Possibility for collisions

    between elements ➤ Counter may be incremented by multiple elements ➤ Taking minimum counter value gives closer approximation
  71. COUNT-MIN SKETCH ➤ Benefits: ➤ Simple! ➤ Sub-linear space ➤

    Useful for detecting “heavy hitters” ➤ Easy to track top-k by adding a min-heap ➤ Drawbacks: ➤ Biased estimator: may overestimate, never underestimates ➤ Better suited to Zipfian distributions & rare events
  72. HyperLogLog P. Flajolet, É. Fusy, O. Gandouet, F. Meunier.
 HyperLogLog:

    the analysis of a near-optimal cardinality estimation algorithm. 2007.
  73. How do we count distinct things in a stream?

  74. COUNTING PROBLEMS ➤ E.g. how many different words are used

    in Wikipedia? ➤ Counter per element explodes memory ➤ Usually requires memory proportional to cardinality ➤ Can we approximate cardinality with constant space?
  75. HYPERLOGLOG ➤ The name: can estimate cardinality of set w/

    cardinality Nmax using loglog(Nmax ) + O(1) bits ➤ Hash element to integer ➤ Count number of leading 0’s in binary form of hash ➤ Track highest number of leading 0’s, n ➤ Cardinality ≈ 2n+1
  76. HYPERLOGLOG ➤ stream = [“foo”, “bar”, “baz”, “qux”] ➤ h(“foo”)

    = 10100001 ➤ h(“bar”) = 01110111 ➤ h(“baz”) = 01110100 ➤ h(“qux”) = 10100011 ➤ n = 1 ➤ |stream| ≈ 2n+1 = 22 = 4
  77. None
  78. It’s actually not magic but just a few really clever

    observations.
  79. With 50/50 odds, how long will it take to flip

    3 heads in a row? 20? 100?
  80. HYPERLOGLOG ➤ Replace “heads” and “tails” with 0’s and 1’s

    ➤ Count leading consecutive 0’s in binary form of hash ➤ E.g. imagine a 4-bit hash, 16 possible values: ➤ 0000 4 leading 0’s ➤ 0001 3 leading 0’s ➤ 0011, 0010 2 leading 0’s ➤ 0100, 0111, 0110, 0101 1 leading 0’s ➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s ➤ Assume good hash function → 1/16 odds for each permutation
  81. HYPERLOGLOG ➤ Track highest number of leading 0’s, n ➤

    n = 0 → 8/16=1/2 odds ➤ n = 1 → 4/16=1/4 odds ➤ n = 2 → 2/16=1/8 odds ➤ n = 3 → 1/16 odds ➤ Cardinality ≈ how many things did we have to look? ➤ E.g. highest count = 1 → 1/4 odds → cardinality 4
  82. None
  83. HYPERLOGLOG ➤ 1/2 of all binary numbers start with 1

    ➤ Each additional bit cuts the probability in half: ➤ 1/4 start with 01 ➤ 1/8 start with 001 ➤ 1/16 start with 0001 ➤ etc. ➤ P(run of length n) = 1 / 2n+1 ➤ Seeing 001 has 1/8 probability, meaning we had to look at approximately 8 things til we saw it (cardinality 8) ➤ Cardinality ≈ prob-1 (reciprocal of probability)
  84. What about outliers?

  85. HYPERLOGLOG ➤ Use multiple buckets ➤ Use first few bits

    of hash to determine bucket ➤ Use remaining bits to count 0’s ➤ Each bucket tracks its own count ➤ Take harmonic mean of all buckets to get cardinality ➤ min(x1 …xn ) ≤ H(x1 …xn ) ≤ n min(x1 …xn ) 01011010001011100101001001101010 bucket counting space
  86. http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html Number of distinct words in all of Shakespeare's work

  87. HYPERLOGLOG ➤ Benefits: ➤ Constant memory ➤ Super fast (calculating

    MSB is cheap) ➤ Can give accurate count with <1% error ➤ Drawbacks: ➤ Has a margin of error (albeit small)
  88. What did we learn?

  89. Data processing has trade-offs.

  90. Probabilistic algorithms trade accuracy for speed and space.

  91. Often we only care about answers that are mostly correct

    but available now.
  92. Sometimes the “right” answer is impossible to compute or simply

    doesn’t exist.
  93. But mostly…

  94. Probabilistic algorithms are just damn cool.

  95. What about the code?

  96. ALGORITHM IMPLEMENTATIONS ➤ Algebird - https://github.com/twitter/algebird ➤ Bloom filter ➤

    Count-min sketch ➤ HyperLogLog ➤ stream-lib - https://github.com/addthis/stream-lib ➤ Bloom filter ➤ Count-min sketch ➤ HyperLogLog ➤ Boom Filters - https://github.com/tylertreat/BoomFilters ➤ Bloom filter ➤ Scalable Bloom filter ➤ Stable Bloom filter ➤ Count-min sketch ➤ HyperLogLog
  97. OTHER COOL PROBABILISTIC ALGORITHMS ➤ Counting Bloom filter (and many

    other Bloom variations) ➤ Bloomier filter (encode functions instead of sets) ➤ Cuckoo filter (Bloom filter w/ cuckoo hashing) ➤ q-digest (quantile approximation) ➤ t-digest (online accumulation of rank-based statistics) ➤ Locality-sensitive hashing (hash similar items to same buckets) ➤ MinHash (set similarity) ➤ Miller–Rabin (primality testing) ➤ Karger’s algorithm (min cut of connected graph)
  98. @tyler_treat github.com/tylertreat bravenewgeek.com Thanks We’re hiring!

  99. BIBLIOGRAPHY Almeida, P ., Baquero, C., Preguica, N., Hutchison, D.

    2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/ members/cbm/ps/dbloom.pdf Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/ ~diaz/p422-bloom.pdf Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf Flajolet, P ., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/ 26/13/1595.full.pdf Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/ TheoryandPracticeBloomFilter2011Tarkoma.pdf Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream- processing-and-probabilistic-methods