Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Look Into Bloom Filters
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Fernando Mendes
October 07, 2016
Programming
0
510
A Look Into Bloom Filters
Fernando Mendes
October 07, 2016
Tweet
Share
More Decks by Fernando Mendes
See All by Fernando Mendes
you. and the morals of technology
fribmendes
1
140
Knee-Deep Into P2P: A Tale of Fail (PWL Porto)
fribmendes
0
65
Knee-Deep Into P2P: A Tale of Fail (ElixirConf EU 2018 version)
fribmendes
0
170
Knee-Deep Into P2P: A Tale of Fail (non-Elixir)
fribmendes
0
180
Bloom Filters: A Look Into Ruby
fribmendes
0
120
Programming WTF: HTML & CSS
fribmendes
4
160
Ruby: A (pointless) Workshop
fribmendes
1
170
Elixir: A Talk For College Students
fribmendes
0
170
Riding Rails
fribmendes
0
110
Other Decks in Programming
See All in Programming
ぼくの開発環境2026
yuzneri
0
240
例外処理とどう使い分ける?Result型を使ったエラー設計 #burikaigi
kajitack
16
6.1k
KIKI_MBSD Cybersecurity Challenges 2025
ikema
0
1.3k
余白を設計しフロントエンド開発を 加速させる
tsukuha
7
2.1k
責任感のあるCloudWatchアラームを設計しよう
akihisaikeda
3
180
なるべく楽してバックエンドに型をつけたい!(楽とは言ってない)
hibiki_cube
0
140
Grafana:建立系統全知視角的捷徑
blueswen
0
330
生成AIを使ったコードレビューで定性的に品質カバー
chiilog
1
270
【卒業研究】会話ログ分析によるユーザーごとの関心に応じた話題提案手法
momok47
0
200
ノイジーネイバー問題を解決する 公平なキューイング
occhi
0
110
Automatic Grammar Agreementと Markdown Extended Attributes について
kishikawakatsumi
0
200
AI によるインシデント初動調査の自動化を行う AI インシデントコマンダーを作った話
azukiazusa1
1
740
Featured
See All Featured
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
220
Exploring the relationship between traditional SERPs and Gen AI search
raygrieselhuber
PRO
2
3.6k
HDC tutorial
michielstock
1
390
SEO in 2025: How to Prepare for the Future of Search
ipullrank
3
3.3k
We Analyzed 250 Million AI Search Results: Here's What I Found
joshbly
1
740
Discover your Explorer Soul
emna__ayadi
2
1.1k
Balancing Empowerment & Direction
lara
5
890
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
50k
Become a Pro
speakerdeck
PRO
31
5.8k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
130
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
94
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
1
300
Transcript
bloom filters a look into
a look into bloom filters
@fribmendes @frmendes
@cesiuminho
@cesiuminho
@coderdojominho
We design and develop thoughtful digital products. BRAGA & BOSTON
@mirrorconf @rubyconfpt
wat the wtf is a bloom filter
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
A funky array with hash functions that’s supposed to be
really really small.
bloom filter do you have ‘abc’ in there?
bloom filter i definitely do not do you have ‘abc’
in there?
how about some ‘xyz’? bloom filter i definitely do not
i mean, yeah, probably bloom filter how about some ‘xyz’?
SERVER
Can I visit “pixels.camp”? SERVER
SERVER Can I visit “pixels.camp”?
Can I visit “pixels.camp”? SERVER CLIENT bloom filter
Pre-filling the bloom filter
add(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘clickformoney.com’)
Can I visit “pixels.camp”? CLIENT
hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT
yes! Can I visit “pixels.camp”? CLIENT
Can I visit “github.com”? CLIENT
hash(‘github.com’) CLIENT Can I visit “github.com”?
nope. Can I visit “github.com”? CLIENT
SERVER Can I visit “github.com”?
you’re good to go Can I visit “github.com”? SERVER
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
Things to consider: bloom filters do inclusion testing
Things to consider: bloom filters turn big data into tiny
data
Things to consider: bloom filters turn false into true
Things to consider: your application must allow false positives
diving into it
module MaliciousUrl class Filter end end
module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end
module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end
module MaliciousUrl class Filter def test(url) @filter[url] end end end
instant access™
instant access™ space complexity: saving key-value tuples
instant access™ space complexity: saving key-value tuples solution: bit arrays
module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end
module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end
module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end
module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end
instant access™
instant access™ space-efficiency
instant access™ space-efficiency small universe == more collisions
instant access™ space-efficiency small universe == more collisions solution: more
hashes
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end
def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end
def indices_of(str) @seeds.map { |seed| hash(str, seed) } end
def add(str) indices_of(str).each { |i| @bits[i] = 1 } end
def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end
a test drive
A benchmark create a bloom filter with 1024 bits insert
900 values test 2048 values
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
Things to consider: the expected amount of entries influences performance
the number of hash functions influences performance Things to consider:
calculating the optimal size & number of hash functions is
a solved problem Things to consider:
calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:
benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:
into the wild
None
None
None
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)
None
@fribmendes @frmendes Fernando Mendes