Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
俺の全文検索エンジン(Go製)を作り始めた
Search
kotaroooo0
November 11, 2020
Programming
0
99
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
November 11, 2020
Tweet
Share
More Decks by kotaroooo0
See All by kotaroooo0
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.1k
転置インデックスでどう検索しているか
kotaroooo0
0
220
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
79
Other Decks in Programming
See All in Programming
Streams APIとTCPフロー制御 / Web Streams API and TCP flow control
tasshi
2
360
Snowflake x dbtで作るセキュアでアジャイルなデータ基盤
tsoshiro
2
520
React CompilerとFine Grained Reactivityと宣言的UIのこれから / The next chapter of declarative UI
ssssota
1
110
EMになってからチームの成果を最大化するために取り組んだこと/ Maximize team performance as EM
nashiusagi
0
100
3rd party scriptでもReactを使いたい! Preact + Reactのハイブリッド開発
righttouch
PRO
1
610
距離関数を極める! / SESSIONS 2024
gam0022
0
290
Better Code Design in PHP
afilina
PRO
0
130
Enabling DevOps and Team Topologies Through Architecture: Architecting for Fast Flow
cer
PRO
0
350
デザインパターンで理解するLLMエージェントの作り方 / How to develop an LLM agent using agentic design patterns
rkaga
3
370
.NET のための通信フレームワーク MagicOnion 入門 / Introduction to MagicOnion
mayuki
1
1.8k
A Journey of Contribution and Collaboration in Open Source
ivargrimstad
0
1.1k
AI時代におけるSRE、 あるいはエンジニアの生存戦略
pyama86
6
1.2k
Featured
See All Featured
Intergalactic Javascript Robots from Outer Space
tanoku
269
27k
GraphQLとの向き合い方2022年版
quramy
43
13k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
27
840
What's in a price? How to price your products and services
michaelherold
243
12k
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
126
18k
Ruby is Unlike a Banana
tanoku
97
11k
4 Signs Your Business is Dying
shpigford
180
21k
Into the Great Unknown - MozCon
thekraken
32
1.5k
Put a Button on it: Removing Barriers to Going Fast.
kastner
59
3.5k
The Power of CSS Pseudo Elements
geoffreycrofte
73
5.3k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
109
49k
YesSQL, Process and Tooling at Scale
rocio
169
14k
Transcript
2020/11/11 @kotaroooo0 ԶͷશจݕࡧΤϯδϯ(Go) Λ࡞Γ࢝Ίͨ
͜ͷLTΛฉ͘ͱ… 1. ͳΜͱͳ͘શจݕࡧΤϯδϯͷΈ͕͔ Δ 2. GoͰશจݕࡧΤϯδϯΛ࡞Γ࢝ΊΒΕΔ
-શจݕࡧΤϯδϯͷΈΛΔ -ElasticsearchͰΘΕ͍ͯΔApache LuceneΆ͍༷ͷͷΛ࡞Δ -GoΛֶΜͰ͍ΔͷͰԿ͔࡞Γ͍ͨ -Twitter BotΛ࡞͍ͬͯΔ͕ɺͪΐ͏Ͳ͍͍શจݕࡧΤϯδϯ͕ͳ͍ -ܰྔɺ͔ͭॏΈ͖ϨʔϕϯγϡλΠϯڑΛܭࢉͰ͖Δͭ શจݕࡧΤϯδϯΛ࡞Δཧ༝
3Ͱ͔ΔશจݕࡧΤϯδϯ
શจݕࡧͷΈ INDEXING ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 จॻ1 “I have a pen.” จॻ2 “We have desk.” CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
શจݕࡧͷΈ SEARCH ୯ޠ จॻ have 1,2 pen 1 we 2
Desk 2 ݕࡧϫʔυ: “pen” จॻ1͕ώοτ CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
ANALYZERͳͥඞཁ? - τʔΫϯׂͯ͘͠ΕΔͨΊ - “I have a pen.” ͜ͷ··ͰసஔΠϯσοΫε Λ࡞Ͱ͖ͳ͍ͷͰɺI,
have, a, penͱτʔΫ ϯׂ͍ͨ͠ - ΫΤϦͷදه༳ΕΛٵऩͨ͠Γ͢ΔͨΊ - “GOD”ͱ͍͏୯ޠΛؚΉυΩϡϝϯτ ɺ”god”Ͱώοτ͢ΔΑ͏ʹখจࣈʹ౷Ұ ͍ͤͨ͞ - ແବͳτʔΫϯͷϑΟϧλϦϯά - theͳͲΠϯσΩγϯάͯ͠ແବ Analyzeલ Analyzeޙ “I have a BIG pen!” have, big, pen
۩ମతͳANALYZERྫ - Char Filter(Tokenizerͷલʹॲཧ͢Δ) 0ݸҎ্ - Mapping: إจࣈΛ୯ޠʹมͳͲ - HTMLstrip:
HTMLΛύʔε - Tokenizer(τʔΫϯׂ͢Δ) 1ݸ - Standard: εϖʔεͳͲϧʔϧʹैׂͬͯ - Kuromoji: ܗଶૉղੳͰׂ - Ngram: Nจࣈ͝ͱʹׂ - Token Filter(Tokenizerͷޙʹॲཧ͢Δ) 0ݸҎ্ - Lowercase: খจࣈ - Stopword: ετοϓϫʔυআڈ - Stemming: දه༳Ε CHAR FILTER TOKENIZER TOKEN FILTER Analyzer
࣮
ANALYZERͷ࣮ߦΠϝʔδ Analyzeલ MappingCharFilter StandardTokenizer LowercaseFilter StopWordFilter StemmerFilter I have a
lot of TASKs. I am very sad :( I have a lot of TASKs. I am very sad _sad_ I, have, a, lot, of, TASKs, I, am, very, sad, sad I, have, a, lot, of, tasks, I, am, very, sad, sad lot, tasks, am, very, sad, sad lot, task, am, very, sad, sad ॲཧͷྲྀΕ
ANALYZERͷ࣮
CHAR FILTERͷ࣮
TOKENIZERͷ࣮
TOKEN FILTERͷ࣮
INDEX - సஔΠϯσοΫε map[string][]int - సஔΠϯσοΫεΛϑΟʔϧυ ໊͝ͱʹ࣋ͭ - υΩϡϝϯτɺIDͱϑΟʔϧ υΛ࣋ͭ
- Indexing͢Δͱ͖AnalyzerΛ ௨͢
SEARCH -ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -OR: ݕࡧ1,2,3,4,5,6
pink Orange blue 6 3 4 5 2 1
ಈ͔ͯ͠ΈΔ
ݕࡧྫ - ※Analyzer͖ͬ͞ͱಉ༷ - IndexSearchAnalyzerΛ௨͢ - ”foxes”Ͱ”fox”ΛؚΉυΩϡϝϯτ͕Ϛον - ”happy”Ͱ”:)”ΛؚΉυΩϡϝϯτ͕Ϛον -
ANDݕࡧ - fine,FaX,foxes,happy͕શؚͯ·Ε͍ͯΔυΩϡϝ ϯτ1,2͕ώοτ͍ͯ͠Δ
ࠓޙ -͍͋·͍ݕࡧΛ࣮͢Δ -Fuzzy QueryɺSuggesters -είΞܭࢉΛ࣮͢Δ -IFIDF, BM25
ࢀߟ -https://github.com/kotaroooo0/stalefish -https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/