Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Building a search engine - in less than 15 minutes
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Ward Bekker
July 12, 2012
Technology
140
3
Share
Building a search engine - in less than 15 minutes
In this talk we dive into two popular compression algorithms in search engines.
Ward Bekker
July 12, 2012
More Decks by Ward Bekker
See All by Ward Bekker
Automated testing with Erlang - these go to eleven -
wardbekker
3
270
Erltricity
wardbekker
1
85
Other Decks in Technology
See All in Technology
え!?初参加で 300冊以上 も頒布!? これは大成功!そのはずなのに わいの財布は 赤字 の件
hellohazime
0
120
Azure Lifecycle with Copilot CLI
torumakabe
0
190
Digitization部 紹介資料
sansan33
PRO
1
7.2k
組織的なAI活用を阻む 最大のハードルは コンテキストデザインだった
ixbox
6
1.7k
2026年度新卒技術研修 サイバーエージェントのデータベース 活用事例とパフォーマンス調査入門
cyberagentdevelopers
PRO
6
7.6k
Hooks, Filters & Now Context: Why MCPs Are the “Hooks” of the AI Era
miriamschwab
0
140
AIがコードを書く時代の ジェネレーティブプログラミング
polidog
PRO
3
680
プロジェクトマネジメントは AIでどう変わるか?
mkg5383
0
220
あるアーキテクチャ決定と その結果/architecture-decision-and-its-result
hanhan1978
2
570
Introduction to Sansan Meishi Maker Development Engineer
sansan33
PRO
0
380
Introduction to Sansan, inc / Sansan Global Development Center, Inc.
sansan33
PRO
0
3k
Kubernetes基盤における開発者体験 とセキュリティの両⽴ / Balancing developer experience and security in a Kubernetes-based environment
chmikata
0
250
Featured
See All Featured
職位にかかわらず全員がリーダーシップを発揮するチーム作り / Building a team where everyone can demonstrate leadership regardless of position
madoxten
62
53k
Lightning Talk: Beautiful Slides for Beginners
inesmontani
PRO
1
510
Build your cross-platform service in a week with App Engine
jlugia
234
18k
BBQ
matthewcrist
89
10k
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
1
3.5k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
31
2.7k
Between Models and Reality
mayunak
3
260
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
100
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Docker and Python
trallard
47
3.8k
Evolving SEO for Evolving Search Engines
ryanjones
0
180
Conquering PDFs: document understanding beyond plain text
inesmontani
PRO
4
2.6k
Transcript
Me • Ward Bekker • TTY.nl • Product owner •
@wardbekker • https://github.com/wardbekker maandag 16 juli 12
Building a search engine in less than 15 minutes maandag
16 juli 12
Documents maandag 16 juli 12
Index maandag 16 juli 12
Querying maandag 16 juli 12
Result maandag 16 juli 12
maandag 16 juli 12
maandag 16 juli 12
Tuples: InvertedIndex: maandag 16 juli 12
Querying maandag 16 juli 12
maandag 16 juli 12
Elias ɣ Encoding • 1, 2, 3 • 00000000 00000001,
00000000 00000010, 00000000 00000011 • 6 bytes = 48 bits maandag 16 juli 12
Elias ɣ Encoding n => .. => length + 1
value 1 => 1 => 1 2 => 10 => 0 10 3 => 11 => 0 11 4 => 100 => 00 100 5 => 101 => 00 101 maandag 16 juli 12
Elias ɣ Encoding • 0000000000000001, 0000000000000010, 0000000000000011 • 48 bits
• 1 010 011 • 7 bits • compression ratio: 0.15 maandag 16 juli 12
Elias ɣ Encoding • erts_debug:size([1,2,3,4,5]) => • 10 words =
10 * 8 bytes = 640 bits • erlang:byte_size(binary:encode_unsigned(2#10100110010000101)). • 3 bytes = 24 bits • compression ratio: 0.04 maandag 16 juli 12
Delta Gap Compression • [3, 7, 8, 9, 12, 13,
14, 15, 16] • 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 • 3 - 1 - 3 - 3 - 2 - 5 • compression ratio: 0.67 maandag 16 juli 12
Combined • 64-bit Erlang • [3, 7, 8, 9, 12,
13, 14, 15, 16] • [3, 1, 3 , 3, 2, 5] • 011101101100101 • 1152 vs 16 bits • compression ratio: 0.014 maandag 16 juli 12
More information • https://github.com/wardbekker/compression/ • Modern Information Retrieval • http://www.mir2ed.org/
• @wardbekker maandag 16 juli 12