Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Look Into Bloom Filters
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Fernando Mendes
October 07, 2016
Programming
0
510
A Look Into Bloom Filters
Fernando Mendes
October 07, 2016
Tweet
Share
More Decks by Fernando Mendes
See All by Fernando Mendes
you. and the morals of technology
fribmendes
1
140
Knee-Deep Into P2P: A Tale of Fail (PWL Porto)
fribmendes
0
65
Knee-Deep Into P2P: A Tale of Fail (ElixirConf EU 2018 version)
fribmendes
0
170
Knee-Deep Into P2P: A Tale of Fail (non-Elixir)
fribmendes
0
180
Bloom Filters: A Look Into Ruby
fribmendes
0
120
Programming WTF: HTML & CSS
fribmendes
4
160
Ruby: A (pointless) Workshop
fribmendes
1
170
Elixir: A Talk For College Students
fribmendes
0
170
Riding Rails
fribmendes
0
110
Other Decks in Programming
See All in Programming
AIエージェントのキホンから学ぶ「エージェンティックコーディング」実践入門
masahiro_nishimi
5
470
カスタマーサクセス業務を変革したヘルススコアの実現と学び
_hummer0724
0
720
CSC307 Lecture 02
javiergs
PRO
1
780
CSC307 Lecture 04
javiergs
PRO
0
660
CSC307 Lecture 08
javiergs
PRO
0
670
AI巻き込み型コードレビューのススメ
nealle
2
420
コマンドとリード間の連携に対する脅威分析フレームワーク
pandayumi
1
460
フロントエンド開発の勘所 -複数事業を経験して見えた判断軸の違い-
heimusu
7
2.8k
Automatic Grammar Agreementと Markdown Extended Attributes について
kishikawakatsumi
0
200
2026年 エンジニアリング自己学習法
yumechi
0
140
AIによる開発の民主化を支える コンテキスト管理のこれまでとこれから
mulyu
3
380
Unicodeどうしてる? PHPから見たUnicode対応と他言語での対応についてのお伺い
youkidearitai
PRO
1
2.6k
Featured
See All Featured
The World Runs on Bad Software
bkeepers
PRO
72
12k
Efficient Content Optimization with Google Search Console & Apps Script
katarinadahlin
PRO
1
330
The #1 spot is gone: here's how to win anyway
tamaranovitovic
2
940
What’s in a name? Adding method to the madness
productmarketing
PRO
24
3.9k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
270
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.8k
How to Get Subject Matter Experts Bought In and Actively Contributing to SEO & PR Initiatives.
livdayseo
0
67
Test your architecture with Archunit
thirion
1
2.2k
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
69
Documentation Writing (for coders)
carmenintech
77
5.3k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
[RailsConf 2023] Rails as a piece of cake
palkan
59
6.3k
Transcript
bloom filters a look into
a look into bloom filters
@fribmendes @frmendes
@cesiuminho
@cesiuminho
@coderdojominho
We design and develop thoughtful digital products. BRAGA & BOSTON
@mirrorconf @rubyconfpt
wat the wtf is a bloom filter
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
A funky array with hash functions that’s supposed to be
really really small.
bloom filter do you have ‘abc’ in there?
bloom filter i definitely do not do you have ‘abc’
in there?
how about some ‘xyz’? bloom filter i definitely do not
i mean, yeah, probably bloom filter how about some ‘xyz’?
SERVER
Can I visit “pixels.camp”? SERVER
SERVER Can I visit “pixels.camp”?
Can I visit “pixels.camp”? SERVER CLIENT bloom filter
Pre-filling the bloom filter
add(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘clickformoney.com’)
Can I visit “pixels.camp”? CLIENT
hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT
yes! Can I visit “pixels.camp”? CLIENT
Can I visit “github.com”? CLIENT
hash(‘github.com’) CLIENT Can I visit “github.com”?
nope. Can I visit “github.com”? CLIENT
SERVER Can I visit “github.com”?
you’re good to go Can I visit “github.com”? SERVER
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
Things to consider: bloom filters do inclusion testing
Things to consider: bloom filters turn big data into tiny
data
Things to consider: bloom filters turn false into true
Things to consider: your application must allow false positives
diving into it
module MaliciousUrl class Filter end end
module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end
module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end
module MaliciousUrl class Filter def test(url) @filter[url] end end end
instant access™
instant access™ space complexity: saving key-value tuples
instant access™ space complexity: saving key-value tuples solution: bit arrays
module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end
module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end
module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end
module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end
instant access™
instant access™ space-efficiency
instant access™ space-efficiency small universe == more collisions
instant access™ space-efficiency small universe == more collisions solution: more
hashes
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end
def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end
def indices_of(str) @seeds.map { |seed| hash(str, seed) } end
def add(str) indices_of(str).each { |i| @bits[i] = 1 } end
def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end
a test drive
A benchmark create a bloom filter with 1024 bits insert
900 values test 2048 values
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
Things to consider: the expected amount of entries influences performance
the number of hash functions influences performance Things to consider:
calculating the optimal size & number of hash functions is
a solved problem Things to consider:
calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:
benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:
into the wild
None
None
None
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)
None
@fribmendes @frmendes Fernando Mendes