Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bloom Filters: A Look Into Ruby

Bloom Filters: A Look Into Ruby

6497e10d8345ce6fee06048127196d6b?s=128

Fernando Mendes

July 29, 2016
Tweet

Transcript

  1. B L O O M F I LT E R

    S or: that one time I was hella bored
  2. Bloom Filters Or: How I Learned To Stop Procrastinating And

    Benchmark The Code
  3. THE A MASTERPIECE OF MODERN HORROR FiLTERiNG

  4. 2016: a space-efficient odyssey An epic drama of boredom and

    exploration
  5. B L O O M F I LT E R

    S or: that one time I was hella bored
  6. “a bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either "possibly in set" or "definitely not in set"” - Wikipedia, 2016
  7. bloom filter

  8. bloom filter do you have the element 3?

  9. bloom filter yeah, probably

  10. bloom filter do you have the element 4?

  11. bloom filter I most certainly do not

  12. bloom filter I most certainly do not “Why do people

    even like this thing?”
  13. add ‘subvisual’

  14. hash(‘subvisual’)

  15. add ‘rubyconf’

  16. hash(‘rubyconf’)

  17. test ‘subvisual’

  18. hash(‘subvisual’) all are 1?

  19. test ‘subvisual’ true

  20. test ‘office’

  21. all are 1? hash(‘office’)

  22. test ‘office’ false

  23. test ‘mirrorconf’

  24. hash(‘mirrorconf’) all are 1?

  25. test ‘mirrorconf’ true

  26. test and add play with hash functions get to say

    smart stuff like “so I wrote this bloom filter”
  27. diving into it with Ruby

  28. module DumbFilter end

  29. module DumbFilter class Array def initialize @data = [] end

    end end
  30. module DumbFilter class Array def add(str) @data << str end

    end end
  31. module DumbFilter class Array def test(str) @data.include? str end end

    end
  32. you don’t play with hash functions sequential access space wastefulness

  33. module DumbFilter class Hash def initialize @data = {} end

    end end
  34. module DumbFilter class Hash def add(str) @data[str] = true end

    end end
  35. module DumbFilter class Hash def test(str) @data[str] end end end

  36. you kinda play with hash functions instant access

  37. “a bloom filter is a space-efficient probabilistic data structure, conceived

    by Burton Howard Bloom in 1970 (…) a query returns either "possibly in set" or "definitely not in set"” - Wikipedia, 2016
  38. /peterc/bitarray

  39. def initialize(size: 1024) @bits = BitArray.new(size) @fnv = FNV.new @size

    = size end
  40. def add(str) @bits[i(str)] = 1 end def i(str) @fnv.fnv1a_64(str) %

    @size end
  41. def test(str) @bits[i(str)] == 1 end

  42. you do play with hash functions instant access space-efficient small

    universe == more collisions
  43. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  44. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  45. def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =

    size @seeds = seed(iterations) end
  46. def seed(nr) (1..nr).each_with_object([]) do |n, s| s << SecureRandom.hex(3).to_i(16) end

    end
  47. def hash(str, seed) MurmurHash3::V32.str_hash(str, seed) end

  48. def i(str) @seeds.map { |s| hash(str, s) % @size }

    end
  49. def add(str) set i(str) end def set(indexes) indexes.each { |i|

    @bits[i] = 1 } end
  50. def test(str) get i(str) end def get(indexes) indexes.all? { |i|

    @bits[i] == 1 } end
  51. demo (yes, yet another goddamned Rails blog app)

  52. None
  53. None
  54. test-drive

  55. 5 million random inserts probabilistic universe of 10 million 5

    million random accesses /igrigorik/bloomfilter-rb
  56. fnv is really slow ruby string hashing is optimized bloomfilter-rb

    uses C extensions
  57. Collision counting ruby’s hash is not probabilistic nor space-efficient “what

    about bf_v2’s poor result?”
  58. you do play with hash functions instant access space-efficient small

    universe == more collisions
  59. Collision counting: 1024 bits & 300 entries m(bits)/n(entries) * ln(2)

    optimal number of hash functions:
  60. in the field

  61. Article tailoring - Quora & Medium Type-ahead queries — Facebook

    I/O Filter — Apache HBase Malicious URL Check — bit.ly Checking node communications in IoT sensors
  62. B L O O M F I LT E R

    S or: that one time I was hella bored