Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bloom Filters: A Look Into Ruby

Bloom Filters: A Look Into Ruby

Fernando Mendes

July 29, 2016
Tweet

More Decks by Fernando Mendes

Other Decks in Programming

Transcript

  1. B L O O M
    F I LT E R S
    or: that one time
    I was hella bored

    View Slide

  2. Bloom Filters
    Or:
    How
    I Learned
    To
    Stop
    Procrastinating
    And
    Benchmark
    The
    Code

    View Slide

  3. THE
    A MASTERPIECE
    OF MODERN HORROR
    FiLTERiNG

    View Slide

  4. 2016: a space-efficient odyssey
    An epic drama of boredom
    and exploration

    View Slide

  5. B L O O M
    F I LT E R S
    or: that one time
    I was hella bored

    View Slide

  6. “a bloom filter is a space-efficient probabilistic
    data structure, conceived by Burton Howard
    Bloom in 1970 (…) a query returns either
    "possibly in set" or "definitely not in set"”
    - Wikipedia, 2016

    View Slide

  7. bloom filter

    View Slide

  8. bloom filter
    do you have
    the element 3?

    View Slide

  9. bloom filter
    yeah, probably

    View Slide

  10. bloom filter
    do you have
    the element 4?

    View Slide

  11. bloom filter
    I most
    certainly do not

    View Slide

  12. bloom filter
    I most
    certainly do not
    “Why do people
    even like this thing?”

    View Slide

  13. add ‘subvisual’

    View Slide

  14. hash(‘subvisual’)

    View Slide

  15. add ‘rubyconf’

    View Slide

  16. hash(‘rubyconf’)

    View Slide

  17. test ‘subvisual’

    View Slide

  18. hash(‘subvisual’)
    all are 1?

    View Slide

  19. test ‘subvisual’
    true

    View Slide

  20. test ‘office’

    View Slide

  21. all are 1?
    hash(‘office’)

    View Slide

  22. test ‘office’
    false

    View Slide

  23. test ‘mirrorconf’

    View Slide

  24. hash(‘mirrorconf’)
    all are 1?

    View Slide

  25. test ‘mirrorconf’
    true

    View Slide

  26. test and add
    play with hash functions
    get to say smart stuff like
    “so I wrote this bloom filter”

    View Slide

  27. diving into it
    with Ruby

    View Slide

  28. module DumbFilter
    end

    View Slide

  29. module DumbFilter
    class Array
    def initialize
    @data = []
    end
    end
    end

    View Slide

  30. module DumbFilter
    class Array
    def add(str)
    @data << str
    end
    end
    end

    View Slide

  31. module DumbFilter
    class Array
    def test(str)
    @data.include? str
    end
    end
    end

    View Slide

  32. you don’t play with hash functions
    sequential access
    space wastefulness

    View Slide

  33. module DumbFilter
    class Hash
    def initialize
    @data = {}
    end
    end
    end

    View Slide

  34. module DumbFilter
    class Hash
    def add(str)
    @data[str] = true
    end
    end
    end

    View Slide

  35. module DumbFilter
    class Hash
    def test(str)
    @data[str]
    end
    end
    end

    View Slide

  36. you kinda play with hash functions
    instant access

    View Slide

  37. “a bloom filter is a space-efficient probabilistic
    data structure, conceived by Burton Howard
    Bloom in 1970 (…) a query returns either
    "possibly in set" or "definitely not in set"”
    - Wikipedia, 2016

    View Slide

  38. /peterc/bitarray

    View Slide

  39. def initialize(size: 1024)
    @bits = BitArray.new(size)
    @fnv = FNV.new
    @size = size
    end

    View Slide

  40. def add(str)
    @bits[i(str)] = 1
    end
    def i(str)
    @fnv.fnv1a_64(str) % @size
    end

    View Slide

  41. def test(str)
    @bits[i(str)] == 1
    end

    View Slide

  42. you do play with hash functions
    instant access
    space-efficient
    small universe == more collisions

    View Slide

  43. def initialize(size: 1024, iterations: 3)
    @bits = BitArray.new(size)
    @size = size
    @seeds = seed(iterations)
    end

    View Slide

  44. def initialize(size: 1024, iterations: 3)
    @bits = BitArray.new(size)
    @size = size
    @seeds = seed(iterations)
    end

    View Slide

  45. def initialize(size: 1024, iterations: 3)
    @bits = BitArray.new(size)
    @size = size
    @seeds = seed(iterations)
    end

    View Slide

  46. def seed(nr)
    (1..nr).each_with_object([]) do |n, s|
    s << SecureRandom.hex(3).to_i(16)
    end
    end

    View Slide

  47. def hash(str, seed)
    MurmurHash3::V32.str_hash(str, seed)
    end

    View Slide

  48. def i(str)
    @seeds.map { |s| hash(str, s) % @size }
    end

    View Slide

  49. def add(str)
    set i(str)
    end
    def set(indexes)
    indexes.each { |i| @bits[i] = 1 }
    end

    View Slide

  50. def test(str)
    get i(str)
    end
    def get(indexes)
    indexes.all? { |i| @bits[i] == 1 }
    end

    View Slide

  51. demo
    (yes, yet another goddamned Rails blog app)

    View Slide

  52. View Slide

  53. View Slide

  54. test-drive

    View Slide

  55. 5 million random inserts
    probabilistic universe of 10 million
    5 million random accesses
    /igrigorik/bloomfilter-rb

    View Slide

  56. fnv is really slow
    ruby string hashing is optimized
    bloomfilter-rb uses C extensions

    View Slide

  57. Collision counting
    ruby’s hash is not probabilistic
    nor space-efficient
    “what about bf_v2’s poor result?”

    View Slide

  58. you do play with hash functions
    instant access
    space-efficient
    small universe == more collisions

    View Slide

  59. Collision counting:
    1024 bits & 300 entries
    m(bits)/n(entries) * ln(2)
    optimal number of hash functions:

    View Slide

  60. in the field

    View Slide

  61. Article tailoring - Quora & Medium
    Type-ahead queries — Facebook
    I/O Filter — Apache HBase
    Malicious URL Check — bit.ly
    Checking node communications in IoT sensors

    View Slide

  62. B L O O M
    F I LT E R S
    or: that one time
    I was hella bored

    View Slide