Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Look Into Bloom Filters
Search
Fernando Mendes
October 07, 2016
Programming
0
510
A Look Into Bloom Filters
Fernando Mendes
October 07, 2016
Tweet
Share
More Decks by Fernando Mendes
See All by Fernando Mendes
you. and the morals of technology
fribmendes
1
140
Knee-Deep Into P2P: A Tale of Fail (PWL Porto)
fribmendes
0
65
Knee-Deep Into P2P: A Tale of Fail (ElixirConf EU 2018 version)
fribmendes
0
170
Knee-Deep Into P2P: A Tale of Fail (non-Elixir)
fribmendes
0
180
Bloom Filters: A Look Into Ruby
fribmendes
0
120
Programming WTF: HTML & CSS
fribmendes
4
160
Ruby: A (pointless) Workshop
fribmendes
1
170
Elixir: A Talk For College Students
fribmendes
0
170
Riding Rails
fribmendes
0
110
Other Decks in Programming
See All in Programming
CSC307 Lecture 08
javiergs
PRO
0
670
AI によるインシデント初動調査の自動化を行う AI インシデントコマンダーを作った話
azukiazusa1
1
740
AI時代のキャリアプラン「技術の引力」からの脱出と「問い」へのいざない / tech-gravity
minodriven
21
7.3k
【卒業研究】会話ログ分析によるユーザーごとの関心に応じた話題提案手法
momok47
0
200
AIと一緒にレガシーに向き合ってみた
nyafunta9858
0
250
今こそ知るべき耐量子計算機暗号(PQC)入門 / PQC: What You Need to Know Now
mackey0225
3
380
izumin5210のプロポーザルのネタ探し #tskaigi_msup
izumin5210
1
130
開発者から情シスまで - 多様なユーザー層に届けるAPI提供戦略 / Postman API Night Okinawa 2026 Winter
tasshi
0
200
ノイジーネイバー問題を解決する 公平なキューイング
occhi
0
110
カスタマーサクセス業務を変革したヘルススコアの実現と学び
_hummer0724
0
720
AI Schema Enrichment for your Oracle AI Database
thatjeffsmith
0
310
AIフル活用時代だからこそ学んでおきたい働き方の心得
shinoyu
0
140
Featured
See All Featured
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
730
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
420
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
410
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
68
Deep Space Network (abreviated)
tonyrice
0
49
How GitHub (no longer) Works
holman
316
140k
Visualization
eitanlees
150
17k
Music & Morning Musume
bryan
47
7.1k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1k
ラッコキーワード サービス紹介資料
rakko
1
2.3M
Un-Boring Meetings
codingconduct
0
200
Accessibility Awareness
sabderemane
0
53
Transcript
bloom filters a look into
a look into bloom filters
@fribmendes @frmendes
@cesiuminho
@cesiuminho
@coderdojominho
We design and develop thoughtful digital products. BRAGA & BOSTON
@mirrorconf @rubyconfpt
wat the wtf is a bloom filter
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
A funky array with hash functions that’s supposed to be
really really small.
bloom filter do you have ‘abc’ in there?
bloom filter i definitely do not do you have ‘abc’
in there?
how about some ‘xyz’? bloom filter i definitely do not
i mean, yeah, probably bloom filter how about some ‘xyz’?
SERVER
Can I visit “pixels.camp”? SERVER
SERVER Can I visit “pixels.camp”?
Can I visit “pixels.camp”? SERVER CLIENT bloom filter
Pre-filling the bloom filter
add(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘totallynotfake.com’)
hash(‘clickformoney.com’)
Can I visit “pixels.camp”? CLIENT
hash(‘pixels.camp’) Can I visit “pixels.camp”? CLIENT
yes! Can I visit “pixels.camp”? CLIENT
Can I visit “github.com”? CLIENT
hash(‘github.com’) CLIENT Can I visit “github.com”?
nope. Can I visit “github.com”? CLIENT
SERVER Can I visit “github.com”?
you’re good to go Can I visit “github.com”? SERVER
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
“A bloom filter is a space-efficient probabilistic data structure, conceived
by Burton Howard Bloom in 1970 (…) a query returns either possibly in set or definitely not in set.” - Wikipedia, 2016
Things to consider: bloom filters do inclusion testing
Things to consider: bloom filters turn big data into tiny
data
Things to consider: bloom filters turn false into true
Things to consider: your application must allow false positives
diving into it
module MaliciousUrl class Filter end end
module MaliciousUrl class Filter def initialize @filter = Hash.new end
end end
module MaliciousUrl class Filter def add(url) @filter[url] = true end
end end
module MaliciousUrl class Filter def test(url) @filter[url] end end end
instant access™
instant access™ space complexity: saving key-value tuples
instant access™ space complexity: saving key-value tuples solution: bit arrays
module MaliciousUrl class Filter def initialize(size: 1024) @bits = BitArray.new(size)
@fnv = FNV.new @size = size end end end
module MaliciousUrl class Filter def hash(str) @fnv.fnv1a_32(str) % @size end
end end
module MaliciousUrl class Filter def add(str) index = hash(str) @bits[index]
= 1 end end end
module MaliciousUrl class Filter def test(str) index = hash(str) @bits[index]
== 1 end end end
instant access™
instant access™ space-efficiency
instant access™ space-efficiency small universe == more collisions
instant access™ space-efficiency small universe == more collisions solution: more
hashes
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def seed(n) seeds = [] n.times do seed = SecureRandom.hex(3).to_i(16)
seeds.push(seed) end seeds end
def seed(iterations) (1..iterations).map do SecureRandom.hex(3).to_i(16) end end because Ruby
def initialize(size: 1024, iterations: 3) @bits = BitArray.new(size) @size =
size @seeds = seed(iterations) end
def hash(str, seed) hash = MurmurHash3::V32.str_hash(str, seed) hash % @size
end
def indices_of(str) @seeds.map { |seed| hash(str, seed) } end
def add(str) indices_of(str).each { |i| @bits[i] = 1 } end
def test(str) indices_of(str).all? { |i| @bits[i] == 1 } end
a test drive
A benchmark create a bloom filter with 1024 bits insert
900 values test 2048 values
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark.rb ### V1 Bloom filter size: 1024. Inserted
values: 900. Tested values: 2048. Positive tests: 1532. False positives: 632. ### V2 Bloom filter size: 1024. Inserted values: 900. * 3 = 2700 Tested values: 2048. Positive tests: 1816. False positives: 916.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
$ ruby benchmark_v2.rb ### V1 Bloom filter size: 1024. Inserted
values: 300. Tested values: 2048. Positive tests: 729. False positives: 429. ### V2 Bloom filter size: 1024. Inserted values: 300. Tested values: 2048. Positive tests: 627. False positives: 327.
Things to consider: the expected amount of entries influences performance
the number of hash functions influences performance Things to consider:
calculating the optimal size & number of hash functions is
a solved problem Things to consider:
calculating the optimal size & number of hash functions is
a solved problem • false positive rate • expected number of items Things to consider:
benchmark, benchmark, benchmark estimate, estimate, estimate Things to consider:
into the wild
None
None
None
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas”
id: 1 id: 2 “fernando” “mendes” “miguel” “palhas” add(“m”) add(“p”)
None
@fribmendes @frmendes Fernando Mendes