A Look Into Bloom Filters

bloom filters a look into

a look into bloom filters

@fribmendes @frmendes

@cesiuminho

@coderdojominho

We design and develop thoughtful digital products. BRAGA & BOSTON

@mirrorconf @rubyconfpt

wat the wtf is a bloom ﬁlter

“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016

A funky array with hash functions that’s supposed to be
really really small.

bloom ﬁlter do you have ‘abc’ in there?

bloom ﬁlter i deﬁnitely do not do you have ‘abc’
in there?

how about some ‘xyz’? bloom ﬁlter i deﬁnitely do not

i mean, yeah, probably bloom ﬁlter how about some ‘xyz’?

SERVER

Can I visit “pixels.camp”? SERVER

SERVER Can I visit “pixels.camp”?

Can I visit “pixels.camp”? SERVER CLIENT bloom ﬁlter

Pre-ﬁlling the bloom ﬁlter

add(‘totallynotfake.com’)

hash(‘totallynotfake.com’)

hash(‘clickformoney.com’)

Can I visit “pixels.camp”? CLIENT

hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT

yes! Can I visit “pixels.camp”? CLIENT

Can I visit “github.com”? CLIENT

hash(‘github.com’) CLIENT Can I visit “github.com”?

nope. Can I visit “github.com”? CLIENT

SERVER Can I visit “github.com”?

you’re good to go Can I visit “github.com”? SERVER

“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016

Things to consider: bloom ﬁlters do inclusion testing

Things to consider: bloom ﬁlters turn big data into tiny
data

Things to consider: bloom ﬁlters turn false into true

Things to consider: your application must allow false positives

diving into it

module MaliciousUrl class Filter end end

module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end

module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end

module MaliciousUrl class Filter def test(url) @filter[url] end end end

instant access™

instant access™ space complexity: saving key-value tuples

instant access™ space complexity: saving key-value tuples solution: bit arrays

module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end

module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end

module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end

module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end

instant access™

instant access™ space-efﬁciency

instant access™ space-efﬁciency small universe == more collisions

instant access™ space-efﬁciency small universe == more collisions solution: more
hashes

def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end

def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end

def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby

def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end

def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end

def indices_of(str) @seeds.map { |seed| hash(str, seed) } end

def add(str) indices_of(str).each { |i| @bits[i] = 1 } end

def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end

a test drive

A benchmark create a bloom ﬁlter with 1024 bits insert
900 values test 2048 values

$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.

$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.

$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.

Things to consider: the expected amount of entries inﬂuences performance

the number of hash functions inﬂuences performance Things to consider:

calculating the optimal size & number of hash functions is
a solved problem Things to consider:

calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:

benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:

into the wild

id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”

id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)

@fribmendes @frmendes Fernando Mendes

A Look Into Bloom Filters

A Look Into Bloom Filters

More Decks by Fernando Mendes

Other Decks in Programming

Featured

Transcript