Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
転置インデックスでどう検索しているか
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
kotaroooo0
November 19, 2020
Technology
0
350
転置インデックスでどう検索しているか
kotaroooo0
November 19, 2020
Tweet
Share
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
100
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.5k
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
0
120
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
95
Other Decks in Technology
See All in Technology
バクラクのSREにおけるAgentic AIへの挑戦/Our Journey with Agentic AI
taddy_919
2
1k
男(監査)はつらいよ - Policy as CodeからAIエージェントへ
ken5scal
5
740
Exadata Database Service on Dedicated Infrastructure(ExaDB-D) UI スクリーン・キャプチャ集
oracle4engineer
PRO
7
7.1k
開発組織の課題解決を加速するための権限委譲 -する側、される側としての向き合い方-
daitasu
5
260
どこで打鍵するのが良い? IaCの実行基盤選定について
nrinetcom
PRO
2
170
ブラックボックス観測に基づくAI支援のプロトコルのリバースエンジニアリングと再現~AIを用いたリバースエンジニアリング~ @ SECCON 14 電脳会議 / Reverse Engineering and Reproduction of an AI-Assisted Protocol Based on Black-Box Observation @ SECCON 14 DENNO-KAIGI
chibiegg
0
150
生成AIの利用とセキュリティ /gen-ai-and-security
mizutani
1
1.2k
ソフトウェアアーキテクトのための意思決定術: Create Decision Readiness—The Real Skill Behind Architectural Decision
snoozer05
PRO
30
9.1k
Sansan Engineering Unit 紹介資料
sansan33
PRO
1
4k
All About Sansan – for New Global Engineers
sansan33
PRO
1
1.4k
Oracle Database@Azure:サービス概要のご紹介
oracle4engineer
PRO
4
1.1k
モブプログラミング再入門 ー 基本から見直す、AI時代のチーム開発の選択肢 ー / A Re-introduction of Mob Programming
takaking22
1
120
Featured
See All Featured
More Than Pixels: Becoming A User Experience Designer
marktimemedia
3
340
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
330
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
660
The Illustrated Guide to Node.js - THAT Conference 2024
reverentgeek
1
280
Building AI with AI
inesmontani
PRO
1
760
Test your architecture with Archunit
thirion
1
2.2k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
280
Embracing the Ebb and Flow
colly
88
5k
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
1.8k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
254
22k
Primal Persuasion: How to Engage the Brain for Learning That Lasts
tmiket
0
280
The Organizational Zoo: Understanding Human Behavior Agility Through Metaphoric Constructive Conversations (based on the works of Arthur Shelley, Ph.D)
kimpetersen
PRO
0
260
Transcript
2020/11/19 @kotaroooo0 సஔΠϯσοΫεͰ Ͳ͏ݕࡧ͍ͯ͠Δ͔
ࣗݾհ
సஔΠϯσοΫεɺͪΖΜͬͯΔΑ? grepΈͨ͘ஞ࣍ݕࡧͯ͠ΔͱΊͪΌͪ͘Ό͕͔͔࣌ؒΔ͔ ΒɺݕࡧΛૣ͘͢ΔͨΊʹࣄલʹຊͷ࣍Έ͍ͨͳͷΛ ࡞͓ͬͯ͘ΜͰ͠ΐ? ͰɺͲ͏ͬͯݕࡧ͍ͯ͠Δ͔·Ͱ… ఆฉ͖ख
సஔΠϯσοΫεΛ Δ
సஔΠϯσοΫε netflix prime amazon - సஔΠϯσοΫε = ࣙॻ + సஔϦετ
1 1 2 3 4 5 5 ࣙॻ సஔϦετ ϙεςΟϯάϦετ
సஔΠϯσοΫε netflix prime amazon - Word-level inverted list ͱݺΕɺ୯ޠ͕จॻͷԿ୯ޠ͔อଘ͢Δ͜ͱ͋Δ -
DocID;offset1,ofset2… 1;2 1;3 2;3 3;5 4;1 5;2 5;3 2;5
୯ޠͷҐஔใͳʹʹ͏? - ϑϨʔζΛ୳͢߹ - ʮAmazon Primeʯͱݕࡧͨ͠߹ - D1: “a prime
concern of Amazon” - D2: “Amazon Prime movies” - Ґஔใ͕͋Ε୯ޠͷॱংΛߟྀ͢Δ͜ͱ͕Ͱ͖ΔͷͰɺD2ͷΈΛώοτ͞ ͤΔ͜ͱ͕Ͱ͖ͨΓɺD2ͷείΞΛେ͖ͨ͘͠Γ͢Δ͜ͱ͕Ͱ͖Δ
ݕࡧ͢Δ
ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -ORݕࡧ: 1,2,3,4,5,6 pink
Orange blue 6 3 4 5 2 1
φΠʔϒͳݕࡧઓུ - ϙεςΟϯάϦετΛࠪ͢Δํࣜ - TAAT(Term At A Time) - ϙεςΟϯάϦετΛ̍ͭͣͭॲཧ͢Δɻಉ࣌ʹ։͘ϙεςΟϯάϦετͷΧʔι
ϧ͚̍ͭͩɻ - ୯ޠ͝ͱʹࠪ͢Δ - DAAT(Document At A Time) - શ୯ޠͷϙεςΟϯάϦετΛಉ࣌ʹॲཧ͢ΔɻΫΤϦʹؚ·ΕΔ୯ޠͷϙε ςΟϯάϦετͷΧʔιϧΛͯ͢։͖ɼಉ࣌ʹਐΊ͍ͯ͘ɻ - υΩϡϝϯτ͝ͱʹࠪ͢Δ
TAATͰͷANDݕࡧ 1. ϙεςΟϯάϦετͷαΠζ͕࠷খͷͷ(prime)Λબ͠ΛɺAccumulator࡞ [2, 5] 2. amazonͷϦετΛࠪ ɾ2ؚ·Ε͍ͯΔ͔?5ؚ·Ε͍ͯΔ͔?ͷΈͷνΣοΫͰOK netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
TAATͰͷORݕࡧ 1. ͲͷΩʔͰྑ͍ͷͰAccumlatorΛ࡞ [1,2,5] (amazon) 2. ॏෳ͠ͳ͍શͯͷΩʔΛݟͯϚʔδ ✌(‘ω'✌ ) ݪ࢝త
( ✌'ω')✌ netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulated = [] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulater = [2] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulate = [2, 5]
netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷORݕࡧ - ΧʔιϧΛಈ͔ͯ͠ɺશͯͷཁૉΛॏෳͳ͘AccumulatorʹՃ ✌(‘ω'✌ ) ݪ࢝త ( ✌'ω')✌ netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͱTAAT - DAATͷϝϦοτ - DAATͷํ͕ɺলϝϞϦͰࡁΉ(ྫ: τοϓ10݅ݕࡧ) - DAATͷํ͕ɺΫΤϦ༻ޠ͕จॻͷಛఆͷ݅Λຬ͍ͨͯ͠Δ͔Ͳ͏͔Λ؆୯ʹࣝ ผͰ͖Δ(ྫ: ϑϨʔζݕࡧɺϑΟϧλϦϯά)
- ElasticsearchͰར༻͞Ε͍ͯΔLuceneDAATํࣜ - ORݕࡧݪ࢝తͳΈͰ͋ΓɺANDݕࡧΑΓଟ͘ͷυΩϡϝϯτΛࠪ͢ΔͨΊɺ ॏ͍ͨ - ݕࡧΤϯδϯORݕࡧʹ࠷దԽ͞Ε͍ͯΔ
ORݕࡧͷ ࠷దԽ४උ
Ͳ͏ߴԽ͢Δ͔ - DAATΛϕʔεʹվળ͢Δ - ݕࡧ݁Ռ্͕Ґ͚݅ͩඞཁͰ͋Δɺ্ҐʹདྷΔՄೳੑ͕ͳ͍จষͷධՁΛεΩοϓ ͢Δ͜ͱʹΑΓɺॲཧͷߴԽ͕Մೳ - ͍߹Θͤʹର্ͯ͠Ґk݅ͷΈΛऔΓग़͢͜ͱΛtop-k query processingͱݺͿ
จষͷϥϯΫ͚ - ্Ґk݅Λग़ྗ͢ΔͨΊʹɺΫΤϦʹରͯ͠ͲͷจষͷॱҐ͕ߴ͍͔Λܾఆ͢Δඞཁ͕͋ Δ - TF-IDF, Okapi BM25
సஔΠϯσοΫεͷ֦ு - సஔΠϯσοΫεʹରͯ͠ɺ֤୯ޠͷείΞ࠷େͱυΩϡϝϯτ͝ͱͷ୯ޠͷείΞ Λ༩͢Δ - ܗࣜ DocID;Score netflix prime amazon
1;5 1;3 2;1 3;2 4;1 5;2 5;1 2;3 max_score=5 max_score=3 max_score=2
Top-k Query ɾmax-score ɾinterval-based running
max-score - ্ҐkҐʹೖΔͨΊͷείΞ5ඞཁͱ͢Δͱɺ”amazon”,”prime”ͷmax_score5ະຬͳͷ Ͱɺ”amazon”ͷΈ”prime”ͷΈΛؚΉจষީิ͔Β֎ΕΔ netflix prime amazon max_score=5 max_score=3 max_score=2
1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - ্Ґ1݅Λऔಘ͍ͨ͠ - จॻ1: score6 - ඞͣ”netflix”ΛؚΉυΩϡϝϯτ͡Όͳ͍ͱμϝ netflix prime
amazon max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - “netflix”ΛؚΉจॻ5·ͰΧʔιϧΛඈ͢ - จॻ5είΞ4 - จॻ1,5ͷΈΛධՁ͢Δ͚ͩͰऴྃ netflix prime amazon
max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 2;3 5;2 3;3 5;1 5;1
Interval-base - max_scoreΑΓ͞ΒʹεΩοϓͰ͖Δ - ͕͜͜ΒΜͰྗਚ͖ͨ࣌ؒͪ͠ΐ͏Ͳ͍͍ͩΖ͏…
LuceneͰ - Lucent 8Ͱmax-scoreͷൃలܗͰ͋ΔWAND͕ΘΕ͍ͯ·͢ - Ding, Shuai, and Torsten Suel.
"Faster top-k document retrieval using block-max indexes." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. - 2019/3/14ϦϦʔεͷLucene 8Ͱ্ͷΞϧΰϦζϜ͕࣮͞ΕΔͳͲࠓͰޮతͳΞϧ ΰϦζϜͷݚڀ͕ଓ͍͍ͯΔ APA
సஔΠϯσοΫεͷ ࣮
·ͱΊ - సஔΠϯσοΫε༷ʑͳϝλใΛՃ͢Δ͜ͱͰ֦ு͞ΕΔ(୯ޠͷΦϑηοτɺε ίΞɺϙΠϯλ) - సஔΠϯσοΫεʹରͯ͠ANDݕࡧɺORݕࡧ͢ΔࡍͷφΠʔϒͳํ๏ - TAAT: ୯ޠ͝ͱʹࠪ͢Δ -
DAAT: จॻ͝ͱʹࠪ͢Δ - DAATʹର͢ΔORݕࡧʹ࠷దԽ - max-score: ୯ޠ͝ͱͷ࠷େείΞͱݱࡏͷ࠷େείΞΛอ࣋͢Δ͜ͱͰࢬמΓΛߦ ͍୳ࡧΛεΩοϓ͢Δ - LuceneͰDAAT͕࠾༻͞Ε͓ͯΓɺݱࡏͰΞϧΰϦζϜ͕վળ͞Ε͍ͯΔ