Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
転置インデックスでどう検索しているか
Search
kotaroooo0
November 19, 2020
Technology
380
0
Share
転置インデックスでどう検索しているか
kotaroooo0
November 19, 2020
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
110
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.6k
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
0
130
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
110
Other Decks in Technology
See All in Technology
Platform engineering for developers, architects & the rest of us (AI agents)
danielbryantuk
0
150
個人AIからチームAIへ:開発における品質と生産性の再設計
moongift
PRO
0
320
oracle-to-databricks-migration-with-llm-and-dbt
casek
1
380
Unlocking the Apps
pimterry
0
120
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
5
1.7k
JJUG CCC 2026 Spring AI時代の開発こそ標準化を武器に! ― 方式・プロセス・プラットフォームの標準化
s27watanabe
2
630
テストコードのないプロジェクトにテストを根付かせる
tttol
0
230
組織の中で自分を経営する技術
shoota
0
230
AI時代の私の技術インプットとアウトプット術
tonkotsuboy_com
15
7.9k
Spring Boot における AOT Cache 活用テクニックと 起動時間改善事例
ntt_dsol_java
0
180
A Harness for Behaviour: how to get AI to generate code that does what we intend, or "TDD in the age of AI"
xpmatteo
1
520
イベントで大活躍する電子ペーパー名札 〜その3〜 / ビジュアルプログラミングIoTLT vol.23
you
PRO
0
170
Featured
See All Featured
How to build a perfect <img>
jonoalderson
1
5.5k
Amusing Abliteration
ianozsvald
1
190
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.6k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
380
Test your architecture with Archunit
thirion
1
2.3k
Sam Torres - BigQuery for SEOs
techseoconnect
PRO
0
280
Claude Code のすすめ
schroneko
67
220k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
How Software Deployment tools have changed in the past 20 years
geshan
0
34k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.2k
Automating Front-end Workflow
addyosmani
1370
210k
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.5k
Transcript
2020/11/19 @kotaroooo0 సஔΠϯσοΫεͰ Ͳ͏ݕࡧ͍ͯ͠Δ͔
ࣗݾհ
సஔΠϯσοΫεɺͪΖΜͬͯΔΑ? grepΈͨ͘ஞ࣍ݕࡧͯ͠ΔͱΊͪΌͪ͘Ό͕͔͔࣌ؒΔ͔ ΒɺݕࡧΛૣ͘͢ΔͨΊʹࣄલʹຊͷ࣍Έ͍ͨͳͷΛ ࡞͓ͬͯ͘ΜͰ͠ΐ? ͰɺͲ͏ͬͯݕࡧ͍ͯ͠Δ͔·Ͱ… ఆฉ͖ख
సஔΠϯσοΫεΛ Δ
సஔΠϯσοΫε netflix prime amazon - సஔΠϯσοΫε = ࣙॻ + సஔϦετ
1 1 2 3 4 5 5 ࣙॻ సஔϦετ ϙεςΟϯάϦετ
సஔΠϯσοΫε netflix prime amazon - Word-level inverted list ͱݺΕɺ୯ޠ͕จॻͷԿ୯ޠ͔อଘ͢Δ͜ͱ͋Δ -
DocID;offset1,ofset2… 1;2 1;3 2;3 3;5 4;1 5;2 5;3 2;5
୯ޠͷҐஔใͳʹʹ͏? - ϑϨʔζΛ୳͢߹ - ʮAmazon Primeʯͱݕࡧͨ͠߹ - D1: “a prime
concern of Amazon” - D2: “Amazon Prime movies” - Ґஔใ͕͋Ε୯ޠͷॱংΛߟྀ͢Δ͜ͱ͕Ͱ͖ΔͷͰɺD2ͷΈΛώοτ͞ ͤΔ͜ͱ͕Ͱ͖ͨΓɺD2ͷείΞΛେ͖ͨ͘͠Γ͢Δ͜ͱ͕Ͱ͖Δ
ݕࡧ͢Δ
ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -ORݕࡧ: 1,2,3,4,5,6 pink
Orange blue 6 3 4 5 2 1
φΠʔϒͳݕࡧઓུ - ϙεςΟϯάϦετΛࠪ͢Δํࣜ - TAAT(Term At A Time) - ϙεςΟϯάϦετΛ̍ͭͣͭॲཧ͢Δɻಉ࣌ʹ։͘ϙεςΟϯάϦετͷΧʔι
ϧ͚̍ͭͩɻ - ୯ޠ͝ͱʹࠪ͢Δ - DAAT(Document At A Time) - શ୯ޠͷϙεςΟϯάϦετΛಉ࣌ʹॲཧ͢ΔɻΫΤϦʹؚ·ΕΔ୯ޠͷϙε ςΟϯάϦετͷΧʔιϧΛͯ͢։͖ɼಉ࣌ʹਐΊ͍ͯ͘ɻ - υΩϡϝϯτ͝ͱʹࠪ͢Δ
TAATͰͷANDݕࡧ 1. ϙεςΟϯάϦετͷαΠζ͕࠷খͷͷ(prime)Λબ͠ΛɺAccumulator࡞ [2, 5] 2. amazonͷϦετΛࠪ ɾ2ؚ·Ε͍ͯΔ͔?5ؚ·Ε͍ͯΔ͔?ͷΈͷνΣοΫͰOK netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
TAATͰͷORݕࡧ 1. ͲͷΩʔͰྑ͍ͷͰAccumlatorΛ࡞ [1,2,5] (amazon) 2. ॏෳ͠ͳ͍શͯͷΩʔΛݟͯϚʔδ ✌(‘ω'✌ ) ݪ࢝త
( ✌'ω')✌ netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulated = [] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulater = [2] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulate = [2, 5]
netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷORݕࡧ - ΧʔιϧΛಈ͔ͯ͠ɺશͯͷཁૉΛॏෳͳ͘AccumulatorʹՃ ✌(‘ω'✌ ) ݪ࢝త ( ✌'ω')✌ netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͱTAAT - DAATͷϝϦοτ - DAATͷํ͕ɺলϝϞϦͰࡁΉ(ྫ: τοϓ10݅ݕࡧ) - DAATͷํ͕ɺΫΤϦ༻ޠ͕จॻͷಛఆͷ݅Λຬ͍ͨͯ͠Δ͔Ͳ͏͔Λ؆୯ʹࣝ ผͰ͖Δ(ྫ: ϑϨʔζݕࡧɺϑΟϧλϦϯά)
- ElasticsearchͰར༻͞Ε͍ͯΔLuceneDAATํࣜ - ORݕࡧݪ࢝తͳΈͰ͋ΓɺANDݕࡧΑΓଟ͘ͷυΩϡϝϯτΛࠪ͢ΔͨΊɺ ॏ͍ͨ - ݕࡧΤϯδϯORݕࡧʹ࠷దԽ͞Ε͍ͯΔ
ORݕࡧͷ ࠷దԽ४උ
Ͳ͏ߴԽ͢Δ͔ - DAATΛϕʔεʹվળ͢Δ - ݕࡧ݁Ռ্͕Ґ͚݅ͩඞཁͰ͋Δɺ্ҐʹདྷΔՄೳੑ͕ͳ͍จষͷධՁΛεΩοϓ ͢Δ͜ͱʹΑΓɺॲཧͷߴԽ͕Մೳ - ͍߹Θͤʹର্ͯ͠Ґk݅ͷΈΛऔΓग़͢͜ͱΛtop-k query processingͱݺͿ
จষͷϥϯΫ͚ - ্Ґk݅Λग़ྗ͢ΔͨΊʹɺΫΤϦʹରͯ͠ͲͷจষͷॱҐ͕ߴ͍͔Λܾఆ͢Δඞཁ͕͋ Δ - TF-IDF, Okapi BM25
సஔΠϯσοΫεͷ֦ு - సஔΠϯσοΫεʹରͯ͠ɺ֤୯ޠͷείΞ࠷େͱυΩϡϝϯτ͝ͱͷ୯ޠͷείΞ Λ༩͢Δ - ܗࣜ DocID;Score netflix prime amazon
1;5 1;3 2;1 3;2 4;1 5;2 5;1 2;3 max_score=5 max_score=3 max_score=2
Top-k Query ɾmax-score ɾinterval-based running
max-score - ্ҐkҐʹೖΔͨΊͷείΞ5ඞཁͱ͢Δͱɺ”amazon”,”prime”ͷmax_score5ະຬͳͷ Ͱɺ”amazon”ͷΈ”prime”ͷΈΛؚΉจষީิ͔Β֎ΕΔ netflix prime amazon max_score=5 max_score=3 max_score=2
1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - ্Ґ1݅Λऔಘ͍ͨ͠ - จॻ1: score6 - ඞͣ”netflix”ΛؚΉυΩϡϝϯτ͡Όͳ͍ͱμϝ netflix prime
amazon max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - “netflix”ΛؚΉจॻ5·ͰΧʔιϧΛඈ͢ - จॻ5είΞ4 - จॻ1,5ͷΈΛධՁ͢Δ͚ͩͰऴྃ netflix prime amazon
max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 2;3 5;2 3;3 5;1 5;1
Interval-base - max_scoreΑΓ͞ΒʹεΩοϓͰ͖Δ - ͕͜͜ΒΜͰྗਚ͖ͨ࣌ؒͪ͠ΐ͏Ͳ͍͍ͩΖ͏…
LuceneͰ - Lucent 8Ͱmax-scoreͷൃలܗͰ͋ΔWAND͕ΘΕ͍ͯ·͢ - Ding, Shuai, and Torsten Suel.
"Faster top-k document retrieval using block-max indexes." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. - 2019/3/14ϦϦʔεͷLucene 8Ͱ্ͷΞϧΰϦζϜ͕࣮͞ΕΔͳͲࠓͰޮతͳΞϧ ΰϦζϜͷݚڀ͕ଓ͍͍ͯΔ APA
సஔΠϯσοΫεͷ ࣮
·ͱΊ - సஔΠϯσοΫε༷ʑͳϝλใΛՃ͢Δ͜ͱͰ֦ு͞ΕΔ(୯ޠͷΦϑηοτɺε ίΞɺϙΠϯλ) - సஔΠϯσοΫεʹରͯ͠ANDݕࡧɺORݕࡧ͢ΔࡍͷφΠʔϒͳํ๏ - TAAT: ୯ޠ͝ͱʹࠪ͢Δ -
DAAT: จॻ͝ͱʹࠪ͢Δ - DAATʹର͢ΔORݕࡧʹ࠷దԽ - max-score: ୯ޠ͝ͱͷ࠷େείΞͱݱࡏͷ࠷େείΞΛอ࣋͢Δ͜ͱͰࢬמΓΛߦ ͍୳ࡧΛεΩοϓ͢Δ - LuceneͰDAAT͕࠾༻͞Ε͓ͯΓɺݱࡏͰΞϧΰϦζϜ͕վળ͞Ε͍ͯΔ