Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Probabilistic algorithms for fun and pseudorandom profit

Tyler Treat
December 05, 2015

Probabilistic algorithms for fun and pseudorandom profit

There's an increasing demand for real-time data ingestion and processing. Systems like Apache Kafka, Samza, and Storm have become popular for this reason. This type of high-volume, online data processing presents an interesting set of new challenges, namely, how do we drink from the firehose without getting drenched? Explore some of the fundamental primitives used in stream processing and, specifically, how we can use probabilistic methods to solve the problem.

Tyler Treat

December 05, 2015
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. PROBABILISTIC ALGORITHMS
    for fun and pseudorandom profit
    Tyler Treat / 12.5.2015

    View full-size slide

  2. ABOUT THE SPEAKER
    ➤ Backend engineer at Workiva
    ➤ Messaging platform tech lead
    ➤ Distributed systems
    ➤ bravenewgeek.com
    @tyler_treat

    [email protected]

    View full-size slide

  3. Time
    Data
    Batch

    (days, hours)
    Meh, data

    (I can store this on

    my laptop)
    Streaming

    (minutes, seconds)
    Oi, data!

    (We’re gonna need

    a bigger boat…)
    Real-Time™

    (I need it now, dammit!)
    Big data™

    (IoT, sensors)
    /dev/null

    View full-size slide

  4. Time
    Data
    Batch

    (days, hours)
    Meh, data

    (I can store this on

    my laptop)
    Streaming

    (minutes, seconds)
    Oi, data!

    (We’re gonna need

    a bigger boat…)
    Real-Time™

    (I need it now, dammit!)
    Big data™

    (IoT, sensors)
    Not Interesting
    Kinda
    Interesting
    Pretty
    Interesting
    /dev/null

    View full-size slide

  5. http://bravenewgeek.com/stream-processing-and-probabilistic-methods/

    View full-size slide

  6. THIS TALK IS NOT
    ➤ About Samza, Storm, Spark Streaming et al.
    ➤ Strictly about stream-processing techniques
    ➤ Mathy
    ➤ Statistics-y

    View full-size slide

  7. THIS TALK IS
    ➤ About basic probability theory
    ➤ About practical design trade-offs
    ➤ About algorithms & data structures
    ➤ About dealing with large or unbounded datasets
    ➤ A marriage of CS & engineering

    View full-size slide

  8. OUTLINE
    ➤ Terminology & context
    ➤ Why probabilistic algorithms?
    ➤ Bloom filters & variants
    ➤ Count-min sketch
    ➤ HyperLogLog

    View full-size slide

  9. Randomized Algorithms
    Las Vegas Algorithms Monte Carlo Algorithms
    Random Input
    Correct result

    Gamble on speed
    Deterministic speed

    Gamble on result

    View full-size slide

  10. Randomized Algorithms
    Las Vegas Algorithms Monte Carlo Algorithms
    Random Input
    Correct result

    Gamble on speed
    Deterministic speed

    Gamble on result

    View full-size slide

  11. DEFINING SOME TERMINOLOGY
    ➤ Online - processing elements as they arrive
    ➤ Offline - entire dataset is known ahead of time
    ➤ Real-time - hard constraint on response time
    ➤ A priori knowledge - something known beforehand

    View full-size slide

  12. BATCH VS STREAMING
    ➤ Batch
    ➤ Offline
    ➤ Heuristics/multiple passes
    ➤ Data structures less important
    documents search index

    View full-size slide

  13. BATCH VS STREAMING
    ➤ Streaming
    ➤ Online, one pass
    ➤ Usually real-time (but not necessarily)
    ➤ Potentially unbounded
    transactions
    caches
    fraud
    analytics

    View full-size slide

  14. 3 DATA INTEGRATION QUESTIONS
    ➤ How do you get the data?
    ➤ How do you disseminate the data?
    ➤ How do you process the data?

    View full-size slide

  15. 3 DATA INTEGRATION QUESTIONS
    ➤ How do you get the data (quickly)?
    ➤ How do you disseminate the data (quickly)?
    ➤ How do you process the data (quickly)?

    View full-size slide

  16. Denormalization is critical to
    performance at scale.

    View full-size slide

  17. How to count the number of
    distinct document views
    across Wikipedia?

    View full-size slide

  18. 10b531cb-914c-4b3e-
    ac1d-11678dd72f7a
    3,042,568
    16-byte GUID 8-byte integer

    View full-size slide

  19. 10b531cb-914c-4b3e-
    ac1d-11678dd72f7a
    5d5d5a78-f98f-4eee-
    bc83-762b3c78f1ea
    3558d299-45ef-4fc9-
    b9ec-902e4943c7f8
    6febb745-c987-4c51-
    afd2-90a55f357d7b
    6f3f199e-4cc3-4c68-
    9d2a-00c31eb199f3
    3,042,568
    1,250,763
    982,531
    24,703,289
    7,401,050

    View full-size slide

  20. Wikipedia has ~38 million pages.

    View full-size slide

  21. 38,000,000 pages x
    (16-byte guid + 8-byte integer)
    ≈ 1GB

    View full-size slide

  22. ➤ Not unreasonable for modern
    hardware
    ➤ Held in memory for lifetime of
    process so will move to old
    GC generations—expensive to
    collect!
    ➤ Now we want to track views
    per unique IP address
    ➤ >4 billion IPv4 addresses
    ➤ Naive solutions quickly
    become intractable

    View full-size slide

  23. DISTRIBUTED SYSTEMS TRADE-OFFS
    Consistency
    Availability
    Partition Tolerance

    View full-size slide

  24. DATA PROCESSING TRADE-OFFS
    Time
    Accuracy
    Space

    View full-size slide

  25. HAVE YOUR CAKE AND EAT IT TOO?
    Stream
    Processing
    Batch
    Processing
    App
    The “Lambda Architecture”

    View full-size slide

  26. Probabilistic algorithms trade
    accuracy for space and performance.

    View full-size slide

  27. “Sketching” data structures make this
    trade by storing a summary of the dataset
    when storing it entirely is prohibitively
    expensive.

    View full-size slide

  28. Bloom Filters
    B. H. Bloom.

    Space/Time Trade-offs in Hash Coding with Allowable Errors. 1970.

    View full-size slide

  29. Answers a simple question:
    is this element a member of a set?
    S ⊆ 

    x ∈ S

    View full-size slide

  30. SET MEMBERSHIP
    ➤ Is this URL malicious?
    ➤ Is this IP address blacklisted?
    ➤ Is this word contained in the document?
    ➤ Is this record in the database?
    ➤ Has this transaction been processed?

    View full-size slide

  31. Hash Table
    entry for each member

    View full-size slide

  32. Bit Array
    bit for each element in universe
    0101101000101110010100100110101…

    View full-size slide

  33. BLOOM FILTERS
    ➤ Bloom filters store set memberships
    ➤ Answers “not in set” or “probably in set”

    View full-size slide

  34. Bloom Filter Secondary Store
    Do you have key 1?
    no
    no
    Do you have key 2?
    Here’s key 2
    yes
    Necessary access
    Here’s key 2
    Do you have key 3?
    no
    yes
    Unnecessary access
    no
    yes
    no

    View full-size slide

  35. BLOOM FILTERS
    ➤ 2 operations: add, lookup
    ➤ Allocate bit array of length m
    ➤ k hash functions
    ➤ Configure m and k for desired false-positive rate

    View full-size slide

  36. BLOOM FILTERS
    ➤ Add element:
    ➤ Hash with k functions to
    get k indices
    ➤ Set bits at each index
    ➤ Lookup:
    ➤ Hash with k functions to
    get k indices
    ➤ Check bit at each index
    ➤ If any bit is unset, element
    not in set

    View full-size slide

  37. BLOOM FILTERS
    ➤ Benefits:
    ➤ More space-efficient than hash table or bit array
    ➤ Can determine trade-off between accuracy and space
    ➤ Drawbacks:
    ➤ Some elements potentially more sensitive to false positives
    than others (solvable by partitioning)
    ➤ Can’t remove elements
    ➤ Requires a priori knowledge of the dataset
    ➤ Over-provisioned filter wastes space

    View full-size slide

  38. Bloom filters are great for efficient offline
    processing, but what about streaming?

    View full-size slide

  39. BLOOM FILTERS WITH A TWIST
    ➤ Rotating Bloom filters
    ➤ e.g. remember everything in the last hour

    ➤ Scalable Bloom Filters
    ➤ Dynamically allocating chained filters

    ➤ Stable Bloom Filters
    ➤ Continuously evict stale data

    View full-size slide

  40. Scalable Bloom Filters
    P. S. Almeida, C. Baquero, N. Preguiça, D. Hutchison.

    Scalable Bloom Filters. 2007.

    View full-size slide

  41. l
    P0 P0 P0 P0
    P0 = error prob. of 1 filter
    l = # filters
    P = compound error prob.
    P = 1 - (1 - P 0 )
    i=0
    l-1

    View full-size slide

  42. P0 = 0.1
    P = 1 - (1 - P0)
    i=0
    l-1

    View full-size slide

  43. SCALABLE BLOOM FILTERS
    ➤ Questions:
    ➤ When to add a new filter?
    ➤ How to place a tight upper bound on P?

    View full-size slide

  44. SCALABLE BLOOM FILTERS
    ➤ When to add a new filter?
    ➤ Fill ratio p = # set bits / # bits
    ➤ Add new filter when target p is reached
    ➤ Optimal target p = 0.5 (math follows from paper)

    View full-size slide

  45. SCALABLE BLOOM FILTERS
    ➤ How to place a tight upper bound on P?
    ➤ Apply tightening ratio r to P0
    , where 0 < r < 1
    ➤ Start with 1 filter, error probability P0
    ➤ When full, add new filter, error probability P1
    =P0
    r
    ➤ Results in geometric series:
    ➤ Series converges on target error probability P

    View full-size slide

  46. P0 = 0.1

    r = 0.5
    P
    P = 1 - (1 - P0r i )
    i=0
    l-1

    View full-size slide

  47. SCALABLE BLOOM FILTERS
    ➤ Add elements to last filter
    ➤ Check each filter on lookups
    ➤ Tightening ratio r controls m and k for new filters

    View full-size slide

  48. SCALABLE BLOOM FILTERS
    ➤ Benefits:
    ➤ Can grow dynamically to accommodate dataset
    ➤ Provides tight upper bound on false-positive rate
    ➤ Can control growth rate
    ➤ Drawbacks:
    ➤ Size still proportional to dataset
    ➤ Additional computation on adds (negligible amortized)

    View full-size slide

  49. Stable Bloom Filters
    F. Deng, D. Rafiei.

    Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters. 2006.

    View full-size slide

  50. DUPLICATE DETECTION
    ➤ Query processing
    ➤ URL crawling
    ➤ Monitoring distinct IP addresses
    ➤ Advertiser click streams
    ➤ Graph processing

    View full-size slide

  51. Bloom filters are remarkably useful for
    dealing with graph data.

    View full-size slide

  52. GRAPH PROCESSING
    ➤ Detecting cycles
    ➤ Pruning search space
    ➤ E.g. often used in bioinformatics
    ➤ Storing chemical structures, properties, and molecular
    fingerprints in filters to optimize searches and determine
    structural similarities
    ➤ Rapid classification of DNA sequences as large as the
    human genome

    View full-size slide

  53. GRAPH PROCESSING
    ➤ Store crawled nodes in memory
    ➤ Set of nodes may be too large to fit in memory
    ➤ Store crawled nodes in secondary storage
    ➤ Too many searches to perform in limited time

    View full-size slide

  54. Precisely eliminating duplicates in an
    unbounded stream isn’t feasible with
    limited space and time.

    View full-size slide

  55. Efficacy/Efficiency Conjecture:

    In many situations, a quick answer with
    an allowable error rate is better than a
    precise one that is slow.

    View full-size slide

  56. Staleness Conjecture:

    In many situations, more recent data has
    more value than stale data.

    View full-size slide

  57. STABLE BLOOM FILTERS
    ➤ Discards old data to make room for new data
    ➤ Replace bit array with array of d-bit counters
    ➤ Initialize counters to zero
    ➤ Maximum counter value Max = 2d - 1

    View full-size slide

  58. STABLE BLOOM FILTERS
    ➤ Add element:
    ➤ Select P random counters and
    decrement by one
    ➤ Hash with k functions to get k indices
    ➤ Set counters at each index to Max
    ➤ Lookup:
    ➤ Hash with k functions to get k indices
    ➤ Check counter at each index
    ➤ If any counter is zero, element not in
    set

    View full-size slide

  59. STABLE BLOOM FILTERS
    ➤ Classic Bloom filter a special case of SBF w/ d=1, P=0
    ➤ Tight upper bound on false positives
    ➤ FP rate asymptotically approaches configurable fixed constant
    (stable-point property)
    ➤ See paper for math and parameter settings
    ➤ Evicting data introduces false negatives

    View full-size slide

  60. STABLE BLOOM FILTERS
    ➤ Benefits:
    ➤ Fixed memory allocation
    ➤ Evicts old data to make room for new data
    ➤ Provides tight upper bound on false positives
    ➤ Drawbacks:
    ➤ Introduces false negatives
    ➤ Additional computation on adds

    View full-size slide

  61. Count-Min Sketch
    G. Cormode, S. Muthukrishnan.

    An Improved Data Stream Summary: The Count-Min Sketch and its Applications. 2003.

    View full-size slide

  62. Can we count element frequencies using
    sub-linear space?
    page views
    94.136.205.1
    132.208.90.15
    54.222.151.15
    7
    4
    11

    View full-size slide

  63. COUNT-MIN SKETCH
    ➤ Approximates frequencies in sub-linear
    space
    ➤ Matrix with w columns and d rows
    ➤ Each row has a hash function
    ➤ Each cell initialized to zero
    ➤ When element arrives:
    ➤ Hash for each row
    ➤ Increment each counter by 1
    ➤ freq(element) = min counter value

    View full-size slide

  64. COUNT-MIN SKETCH
    ➤ Why the minimum?
    ➤ Possibility for collisions between elements
    ➤ Counter may be incremented by multiple elements
    ➤ Taking minimum counter value gives closer approximation

    View full-size slide

  65. COUNT-MIN SKETCH
    ➤ Benefits:
    ➤ Simple!
    ➤ Sub-linear space
    ➤ Useful for detecting “heavy hitters”
    ➤ Easy to track top-k by adding a min-heap
    ➤ Drawbacks:
    ➤ Biased estimator: may overestimate, never underestimates
    ➤ Better suited to Zipfian distributions & rare events

    View full-size slide

  66. HyperLogLog
    P. Flajolet, É. Fusy, O. Gandouet, F. Meunier.

    HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. 2007.

    View full-size slide

  67. How do we count distinct things in a stream?

    View full-size slide

  68. COUNTING PROBLEMS
    ➤ E.g. how many different words are used in Wikipedia?
    ➤ Counter per element explodes memory
    ➤ Usually requires memory proportional to cardinality
    ➤ Can we approximate cardinality with constant space?

    View full-size slide

  69. HYPERLOGLOG
    ➤ The name: can estimate cardinality of set w/ cardinality Nmax
    using loglog(Nmax
    ) + O(1) bits
    ➤ Hash element to integer
    ➤ Count number of leading 0’s in binary form of hash
    ➤ Track highest number of leading 0’s, n
    ➤ Cardinality ≈ 2n+1

    View full-size slide

  70. HYPERLOGLOG
    ➤ stream = [“foo”, “bar”, “baz”, “qux”]
    ➤ h(“foo”) = 10100001
    ➤ h(“bar”) = 01110111
    ➤ h(“baz”) = 01110100
    ➤ h(“qux”) = 10100011
    ➤ n = 1
    ➤ |stream| ≈ 2n+1 = 22 = 4

    View full-size slide

  71. It’s actually not magic but just a few
    really clever observations.

    View full-size slide

  72. With 50/50 odds, how long will it take to flip
    3 heads in a row? 20? 100?

    View full-size slide

  73. HYPERLOGLOG
    ➤ Replace “heads” and “tails” with 0’s and 1’s
    ➤ Count leading consecutive 0’s in binary form of hash
    ➤ E.g. imagine a 4-bit hash, 16 possible values:
    ➤ 0000 4 leading 0’s
    ➤ 0001 3 leading 0’s
    ➤ 0011, 0010 2 leading 0’s
    ➤ 0100, 0111, 0110, 0101 1 leading 0’s
    ➤ 1111, 1110, 1001 1010, 1101, 1100 1011, 1000 0 leading 0’s
    ➤ Assume good hash function → 1/16 odds for each permutation

    View full-size slide

  74. HYPERLOGLOG
    ➤ Track highest number of leading 0’s, n
    ➤ n = 0 → 8/16=1/2 odds
    ➤ n = 1 → 4/16=1/4 odds
    ➤ n = 2 → 2/16=1/8 odds
    ➤ n = 3 → 1/16 odds
    ➤ Cardinality ≈ how many things did we have to look?
    ➤ E.g. highest count = 1 → 1/4 odds → cardinality 4

    View full-size slide

  75. HYPERLOGLOG
    ➤ 1/2 of all binary numbers start with 1
    ➤ Each additional bit cuts the probability in half:
    ➤ 1/4 start with 01
    ➤ 1/8 start with 001
    ➤ 1/16 start with 0001
    ➤ etc.
    ➤ P(run of length n) = 1 / 2n+1
    ➤ Seeing 001 has 1/8 probability, meaning we had to look at
    approximately 8 things til we saw it (cardinality 8)
    ➤ Cardinality ≈ prob-1 (reciprocal of probability)

    View full-size slide

  76. What about outliers?

    View full-size slide

  77. HYPERLOGLOG
    ➤ Use multiple buckets
    ➤ Use first few bits of hash to determine bucket
    ➤ Use remaining bits to count 0’s
    ➤ Each bucket tracks its own count
    ➤ Take harmonic mean of all buckets to get cardinality
    ➤ min(x1
    …xn
    ) ≤ H(x1
    …xn
    ) ≤ n min(x1
    …xn
    )
    01011010001011100101001001101010
    bucket counting space

    View full-size slide

  78. http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html
    Number of distinct words in all of Shakespeare's work

    View full-size slide

  79. HYPERLOGLOG
    ➤ Benefits:
    ➤ Constant memory
    ➤ Super fast (calculating MSB is cheap)
    ➤ Can give accurate count with <1% error
    ➤ Drawbacks:
    ➤ Has a margin of error (albeit small)

    View full-size slide

  80. What did we learn?

    View full-size slide

  81. Data processing has trade-offs.

    View full-size slide

  82. Probabilistic algorithms trade accuracy for
    speed and space.

    View full-size slide

  83. Often we only care about answers that are
    mostly correct but available now.

    View full-size slide

  84. Sometimes the “right” answer is impossible
    to compute or simply doesn’t exist.

    View full-size slide

  85. But mostly…

    View full-size slide

  86. Probabilistic algorithms are
    just damn cool.

    View full-size slide

  87. What about the code?

    View full-size slide

  88. ALGORITHM IMPLEMENTATIONS
    ➤ Algebird - https://github.com/twitter/algebird
    ➤ Bloom filter
    ➤ Count-min sketch
    ➤ HyperLogLog
    ➤ stream-lib - https://github.com/addthis/stream-lib
    ➤ Bloom filter
    ➤ Count-min sketch
    ➤ HyperLogLog
    ➤ Boom Filters - https://github.com/tylertreat/BoomFilters
    ➤ Bloom filter
    ➤ Scalable Bloom filter
    ➤ Stable Bloom filter
    ➤ Count-min sketch
    ➤ HyperLogLog

    View full-size slide

  89. OTHER COOL PROBABILISTIC ALGORITHMS
    ➤ Counting Bloom filter (and many other Bloom variations)
    ➤ Bloomier filter (encode functions instead of sets)
    ➤ Cuckoo filter (Bloom filter w/ cuckoo hashing)
    ➤ q-digest (quantile approximation)
    ➤ t-digest (online accumulation of rank-based statistics)
    ➤ Locality-sensitive hashing (hash similar items to same buckets)
    ➤ MinHash (set similarity)
    ➤ Miller–Rabin (primality testing)
    ➤ Karger’s algorithm (min cut of connected graph)

    View full-size slide

  90. @tyler_treat
    github.com/tylertreat
    bravenewgeek.com
    Thanks
    We’re hiring!

    View full-size slide

  91. BIBLIOGRAPHY
    Almeida, P
    ., Baquero, C., Preguica, N., Hutchison, D. 2007. Scalable Bloom Filters; http://gsd.di.uminho.pt/
    members/cbm/ps/dbloom.pdf
    Bloom, B. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors; https://www.cs.upc.edu/
    ~diaz/p422-bloom.pdf
    Cormode, G., & Muthukrishnan, S. 2003. An Improved Data Stream Summary: The Count-Min Sketch and its
    Applications; http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf
    Deng, F., & Rafiei, D. 2006. Approximately Detecting Duplicates for Streaming Data using Stable Bloom
    Filters; https://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf
    Flajolet, P
    ., Fusy, É, Gandouet, O., Meunier, F. 2007. HyperLogLog: The analysis of a near-optimal cardinality
    estimation algorithm; http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
    Stranneheim, H., Käller, M., Allander, T., Andersson, B., Arvestad, L., Lundeberg, J. 2010. Classification of
    DNA sequences using Bloom filters. Bioinformatics, 26(13); http://bioinformatics.oxfordjournals.org/content/
    26/13/1595.full.pdf
    Tarkoma, S., Rothenberg, C., & Lagerspetz, E. 2011. Theory and Practice of Bloom Filters for Distributed
    Systems. IEEE Communications Surveys & Tutorials, 14(1); https://gnunet.org/sites/default/files/
    TheoryandPracticeBloomFilter2011Tarkoma.pdf
    Treat, T. 2015. Stream Processing and Probabilistic Methods: Data at Scale; http://bravenewgeek.com/stream-
    processing-and-probabilistic-methods

    View full-size slide