Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Look Into Bloom Filters
Search
Fernando Mendes
October 07, 2016
Programming
0
320
A Look Into Bloom Filters
Fernando Mendes
October 07, 2016
Tweet
Share
More Decks by Fernando Mendes
See All by Fernando Mendes
you. and the morals of technology
fribmendes
1
110
Knee-Deep Into P2P: A Tale of Fail (PWL Porto)
fribmendes
0
51
Knee-Deep Into P2P: A Tale of Fail (ElixirConf EU 2018 version)
fribmendes
0
130
Knee-Deep Into P2P: A Tale of Fail (non-Elixir)
fribmendes
0
140
Bloom Filters: A Look Into Ruby
fribmendes
0
100
Programming WTF: HTML & CSS
fribmendes
4
150
Ruby: A (pointless) Workshop
fribmendes
1
160
Elixir: A Talk For College Students
fribmendes
0
160
Riding Rails
fribmendes
0
100
Other Decks in Programming
See All in Programming
ブラウザ単体でmp4書き出すまで - muddy-web - 2024-12
yue4u
3
480
Cloudflare MCP ServerでClaude Desktop からWeb APIを構築
kutakutat
1
550
rails stats で紐解く ANDPAD のイマを支える技術たち
andpad
1
290
Recoilを剥がしている話
kirik
5
6.8k
情報漏洩させないための設計
kubotak
3
340
「Chatwork」Android版アプリを 支える単体テストの現在
okuzawats
0
180
Effective Signals in Angular 19+: Rules and Helpers @ngbe2024
manfredsteyer
PRO
0
140
PHPで作るWebSocketサーバー ~リアクティブなアプリケーションを知るために~ / WebSocket Server in PHP - To know reactive applications
seike460
PRO
2
510
Scalaから始めるOpenFeature入門 / Scalaわいわい勉強会 #4
arthur1
1
340
LLM Supervised Fine-tuningの理論と実践
datanalyticslabo
7
1.3k
Асинхронность неизбежна: как мы проектировали сервис уведомлений
lamodatech
0
850
Go の GC の不得意な部分を克服したい
taiyow
3
800
Featured
See All Featured
YesSQL, Process and Tooling at Scale
rocio
169
14k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
226
22k
Statistics for Hackers
jakevdp
796
220k
Keith and Marios Guide to Fast Websites
keithpitt
410
22k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
127
18k
Making the Leap to Tech Lead
cromwellryan
133
9k
Embracing the Ebb and Flow
colly
84
4.5k
Faster Mobile Websites
deanohume
305
30k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Fantastic passwords and where to find them - at NoRuKo
philnash
50
2.9k
[RailsConf 2023] Rails as a piece of cake
palkan
53
5k
Transcript
bloom filters a look into
a look into bloom filters
@fribmendes @frmendes
@cesiuminho
@cesiuminho
@coderdojominho
We design and develop thoughtful digital products. BRAGA & BOSTON
@mirrorconf @rubyconfpt
wat the wtf is a bloom filter
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
A funky array with hash functions that’s supposed to be
really really small.
bloom filter do you have ‘abc’ in there?
bloom filter i definitely do not do you have ‘abc’
in there?
how about some ‘xyz’? bloom filter i definitely do not
i mean, yeah, probably bloom filter how about some ‘xyz’?
SERVER
Can I visit “pixels.camp”? SERVER
SERVER Can I visit “pixels.camp”?
Can I visit “pixels.camp”? SERVER CLIENT bloom filter
Pre-filling the bloom filter
add(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘clickformoney.com’)
Can I visit “pixels.camp”? CLIENT
hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT
yes! Can I visit “pixels.camp”? CLIENT
Can I visit “github.com”? CLIENT
hash(‘github.com’) CLIENT Can I visit “github.com”?
nope. Can I visit “github.com”? CLIENT
SERVER Can I visit “github.com”?
you’re good to go Can I visit “github.com”? SERVER
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
Things to consider: bloom filters do inclusion testing
Things to consider: bloom filters turn big data into tiny
data
Things to consider: bloom filters turn false into true
Things to consider: your application must allow false positives
diving into it
module MaliciousUrl class Filter end end
module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end
module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end
module MaliciousUrl class Filter def test(url) @filter[url] end end end
instant access™
instant access™ space complexity: saving key-value tuples
instant access™ space complexity: saving key-value tuples solution: bit arrays
module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end
module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end
module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end
module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end
instant access™
instant access™ space-efficiency
instant access™ space-efficiency small universe == more collisions
instant access™ space-efficiency small universe == more collisions solution: more
hashes
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end
def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end
def indices_of(str) @seeds.map { |seed| hash(str, seed) } end
def add(str) indices_of(str).each { |i| @bits[i] = 1 } end
def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end
a test drive
A benchmark create a bloom filter with 1024 bits insert
900 values test 2048 values
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
Things to consider: the expected amount of entries influences performance
the number of hash functions influences performance Things to consider:
calculating the optimal size & number of hash functions is
a solved problem Things to consider:
calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:
benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:
into the wild
None
None
None
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)
None
@fribmendes @frmendes Fernando Mendes