A Look Into Bloom Filters

A Look Into Bloom Filters

6497e10d8345ce6fee06048127196d6b?s=128

Fernando Mendes

October 07, 2016
Tweet

Transcript

  1. bloom filters a look into

  2. a look into bloom filters

  3. @fribmendes @frmendes

  4. @cesiuminho

  5. @cesiuminho

  6. @coderdojominho

  7. We design and develop thoughtful digital products. BRAGA & BOSTON

  8. @mirrorconf @rubyconfpt

  9. wat the wtf is a bloom filter

  10. “A bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
  11. A funky array with hash functions that’s supposed to be

    really really small.
  12. bloom filter do you have ‘abc’ in there?

  13. bloom filter i definitely do not do you have ‘abc’

    in there?
  14. how about some ‘xyz’? bloom filter i definitely do not

  15. i mean, yeah, probably bloom filter how about some ‘xyz’?

  16. SERVER

  17. Can I visit “pixels.camp”? SERVER

  18. SERVER Can I visit “pixels.camp”?

  19. Can I visit “pixels.camp”? SERVER CLIENT bloom filter

  20. Pre-filling the bloom filter

  21. add(‘totallynotfake.com’)

  22. hash(‘totallynotfake.com’)

  23. hash(‘totallynotfake.com’)

  24. hash(‘clickformoney.com’)

  25. Can I visit “pixels.camp”? CLIENT

  26. hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT

  27. yes! Can I visit “pixels.camp”? CLIENT

  28. Can I visit “github.com”? CLIENT

  29. hash(‘github.com’) CLIENT Can I visit “github.com”?

  30. nope. Can I visit “github.com”? CLIENT

  31. SERVER Can I visit “github.com”?

  32. you’re good to go Can I visit “github.com”? SERVER

  33. “A bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
  34. “A bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
  35. “A bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
  36. Things to consider: bloom filters do inclusion testing

  37. Things to consider: bloom filters turn big data into tiny

    data
  38. Things to consider: bloom filters turn false into true

  39. Things to consider: your application must allow false positives

  40. diving into it

  41. module MaliciousUrl class Filter end end

  42. module MaliciousUrl class Filter def initialize @filter = Hash.new end

    end end
  43. module MaliciousUrl class Filter def add(url) @filter[url] = true end

    end end
  44. module MaliciousUrl class Filter def test(url) @filter[url] end end end

  45. instant access™

  46. instant access™ space complexity: saving key-value tuples

  47. instant access™ space complexity: saving key-value tuples solution: bit arrays

  48. module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)

    @fnv = FNV.new @size = size end end end
  49. module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end

    end end
  50. module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]

    = 1 end end end
  51. module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]

    == 1 end end end
  52. instant access™

  53. instant access™ space-efficiency

  54. instant access™ space-efficiency small universe == more collisions

  55. instant access™ space-efficiency small universe == more collisions solution: more

    hashes
  56. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  57. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  58. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  59. def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)

    seeds.push(seed) end seeds end
  60. def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby

  61. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  62. def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size

    end
  63. def indices_of(str) @seeds.map { |seed| hash(str, seed) } end

  64. def add(str) indices_of(str).each { |i| @bits[i] = 1 } end

  65. def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end

  66. a test drive

  67. A benchmark create a bloom filter with 1024 bits insert

    900 values test 2048 values
  68. $ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted

    values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
  69. $ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted

    values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
  70. $ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted

    values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
  71. $ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted

    values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
  72. $ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted

    values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
  73. Things to consider: the expected amount of entries influences performance

  74. the number of hash functions influences performance Things to consider:

  75. calculating the optimal size & number of hash functions is

    a solved problem Things to consider:
  76. calculating the optimal size & number of hash functions is

    a solved problem • false positive rate • expected number of items Things to consider:
  77. benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:

  78. into the wild

  79. None
  80. None
  81. None
  82. id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”

  83. id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)

  84. None
  85. @fribmendes @frmendes Fernando Mendes