Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Look Into Bloom Filters
Search
Fernando Mendes
October 07, 2016
Programming
530
0
Share
A Look Into Bloom Filters
Fernando Mendes
October 07, 2016
More Decks by Fernando Mendes
See All by Fernando Mendes
you. and the morals of technology
fribmendes
1
140
Knee-Deep Into P2P: A Tale of Fail (PWL Porto)
fribmendes
0
69
Knee-Deep Into P2P: A Tale of Fail (ElixirConf EU 2018 version)
fribmendes
0
180
Knee-Deep Into P2P: A Tale of Fail (non-Elixir)
fribmendes
0
200
Bloom Filters: A Look Into Ruby
fribmendes
0
130
Programming WTF: HTML & CSS
fribmendes
4
170
Ruby: A (pointless) Workshop
fribmendes
1
170
Elixir: A Talk For College Students
fribmendes
0
180
Riding Rails
fribmendes
0
110
Other Decks in Programming
See All in Programming
サプライチェーン攻撃対策「層を重ねて落ちない壁」を10日間で組み上げた話 #TechLeadConf2026
kashewnuts
1
260
運転動画を検索可能にする〜Cosmos-Embed1とDatabricks Vector Searchで〜/cosmos-embed1-databricks-vector-search
studio_graph
2
850
Liberating Ruby's Parser from Lexer Hacks
ydah
2
2.7k
要はバランスからの卒業 #yumemi_grow
kajitack
0
150
My daily life on Ruby
a_matsuda
3
330
クラウドネイティブなエンジニアに向ける Raycastの魅力と実際の活用事例
nealle
2
260
Kingdom of the Machine
yui_knk
2
1.5k
Import assertionsが消えた日~ECMAScriptの仕様はどう決まり、なぜ覆るのか~
bicstone
2
180
〜バイブコーディングを超えて〜 チームで実験し続けたAI駆動開発
tigertora7571
0
200
属人化しないコード品質の作り方_2026.04.07.pdf
muraaano
0
350
Firefoxにコントリビューションして得られた学び
ken7253
2
160
ふにゃっとしない名前の付け方 〜哲学で茹で上げる、コシのあるソフトウェア設計〜
shimomura
0
120
Featured
See All Featured
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
250
It's Worth the Effort
3n
188
29k
Reality Check: Gamification 10 Years Later
codingconduct
0
2.1k
Agile Actions for Facilitating Distributed Teams - ADO2019
mkilby
0
190
Bootstrapping a Software Product
garrettdimon
PRO
307
120k
Deep Space Network (abreviated)
tonyrice
0
130
Exploring anti-patterns in Rails
aemeredith
3
350
Performance Is Good for Brains [We Love Speed 2024]
tammyeverts
12
1.6k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.1k
The Curse of the Amulet
leimatthew05
1
12k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
47
8.1k
Transcript
bloom filters a look into
a look into bloom filters
@fribmendes @frmendes
@cesiuminho
@cesiuminho
@coderdojominho
We design and develop thoughtful digital products. BRAGA & BOSTON
@mirrorconf @rubyconfpt
wat the wtf is a bloom filter
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
A funky array with hash functions that’s supposed to be
really really small.
bloom filter do you have ‘abc’ in there?
bloom filter i definitely do not do you have ‘abc’
in there?
how about some ‘xyz’? bloom filter i definitely do not
i mean, yeah, probably bloom filter how about some ‘xyz’?
SERVER
Can I visit “pixels.camp”? SERVER
SERVER Can I visit “pixels.camp”?
Can I visit “pixels.camp”? SERVER CLIENT bloom filter
Pre-filling the bloom filter
add(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘clickformoney.com’)
Can I visit “pixels.camp”? CLIENT
hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT
yes! Can I visit “pixels.camp”? CLIENT
Can I visit “github.com”? CLIENT
hash(‘github.com’) CLIENT Can I visit “github.com”?
nope. Can I visit “github.com”? CLIENT
SERVER Can I visit “github.com”?
you’re good to go Can I visit “github.com”? SERVER
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
Things to consider: bloom filters do inclusion testing
Things to consider: bloom filters turn big data into tiny
data
Things to consider: bloom filters turn false into true
Things to consider: your application must allow false positives
diving into it
module MaliciousUrl class Filter end end
module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end
module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end
module MaliciousUrl class Filter def test(url) @filter[url] end end end
instant access™
instant access™ space complexity: saving key-value tuples
instant access™ space complexity: saving key-value tuples solution: bit arrays
module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end
module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end
module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end
module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end
instant access™
instant access™ space-efficiency
instant access™ space-efficiency small universe == more collisions
instant access™ space-efficiency small universe == more collisions solution: more
hashes
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end
def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end
def indices_of(str) @seeds.map { |seed| hash(str, seed) } end
def add(str) indices_of(str).each { |i| @bits[i] = 1 } end
def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end
a test drive
A benchmark create a bloom filter with 1024 bits insert
900 values test 2048 values
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
Things to consider: the expected amount of entries influences performance
the number of hash functions influences performance Things to consider:
calculating the optimal size & number of hash functions is
a solved problem Things to consider:
calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:
benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:
into the wild
None
None
None
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)
None
@fribmendes @frmendes Fernando Mendes