Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
転置インデックスでどう検索しているか
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
kotaroooo0
November 19, 2020
Technology
0
360
転置インデックスでどう検索しているか
kotaroooo0
November 19, 2020
Tweet
Share
More Decks by kotaroooo0
See All by kotaroooo0
データ鮮度を落とさずに安全にReindexしたい
kotaroooo0
0
100
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.5k
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
0
120
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
97
Other Decks in Technology
See All in Technology
中央集権型を脱却した話 分散型をやめて、連邦型にたどり着くまで
sansantech
PRO
1
340
Agent Skill 是什麼?對軟體產業帶來的變化
appleboy
0
220
俺の/私の最強アーキテクチャ決定戦開催 ― チームで新しいアーキテクチャに適合していくために / 20260322 Naoki Takahashi
shift_evolve
PRO
1
430
Phase06_ClaudeCode実践
overflowinc
0
1.8k
FastMCP OAuth Proxy with Cognito
hironobuiga
3
180
Phase07_実務適用
overflowinc
0
1.7k
大規模ECサイトのあるバッチのパフォーマンスを改善するために僕たちのチームがしてきたこと
panda_program
1
380
Zero Data Loss Autonomous Recovery Service サービス概要
oracle4engineer
PRO
4
13k
Escape from Excel方眼紙 ~マークダウンで繋ぐ、人とAIの架け橋~ /nikkei-tech-talk44
nikkei_engineer_recruiting
0
200
ABEMAのバグバウンティの取り組み
kurochan
1
610
SSoT(Single Source of Truth)で「壊して再生」する設計
kawauso
2
320
開発チームとQAエンジニアの新しい協業モデル -年末調整開発チームで実践する【QAリード施策】-
kaomi_wombat
0
240
Featured
See All Featured
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
1
1.4k
YesSQL, Process and Tooling at Scale
rocio
174
15k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
250
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.6k
Designing Dashboards & Data Visualisations in Web Apps
destraynor
231
54k
Deep Space Network (abreviated)
tonyrice
0
95
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.2k
Raft: Consensus for Rubyists
vanstee
141
7.4k
Are puppies a ranking factor?
jonoalderson
1
3.1k
Imperfection Machines: The Place of Print at Facebook
scottboms
269
14k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
84
Transcript
2020/11/19 @kotaroooo0 సஔΠϯσοΫεͰ Ͳ͏ݕࡧ͍ͯ͠Δ͔
ࣗݾհ
సஔΠϯσοΫεɺͪΖΜͬͯΔΑ? grepΈͨ͘ஞ࣍ݕࡧͯ͠ΔͱΊͪΌͪ͘Ό͕͔͔࣌ؒΔ͔ ΒɺݕࡧΛૣ͘͢ΔͨΊʹࣄલʹຊͷ࣍Έ͍ͨͳͷΛ ࡞͓ͬͯ͘ΜͰ͠ΐ? ͰɺͲ͏ͬͯݕࡧ͍ͯ͠Δ͔·Ͱ… ఆฉ͖ख
సஔΠϯσοΫεΛ Δ
సஔΠϯσοΫε netflix prime amazon - సஔΠϯσοΫε = ࣙॻ + సஔϦετ
1 1 2 3 4 5 5 ࣙॻ సஔϦετ ϙεςΟϯάϦετ
సஔΠϯσοΫε netflix prime amazon - Word-level inverted list ͱݺΕɺ୯ޠ͕จॻͷԿ୯ޠ͔อଘ͢Δ͜ͱ͋Δ -
DocID;offset1,ofset2… 1;2 1;3 2;3 3;5 4;1 5;2 5;3 2;5
୯ޠͷҐஔใͳʹʹ͏? - ϑϨʔζΛ୳͢߹ - ʮAmazon Primeʯͱݕࡧͨ͠߹ - D1: “a prime
concern of Amazon” - D2: “Amazon Prime movies” - Ґஔใ͕͋Ε୯ޠͷॱংΛߟྀ͢Δ͜ͱ͕Ͱ͖ΔͷͰɺD2ͷΈΛώοτ͞ ͤΔ͜ͱ͕Ͱ͖ͨΓɺD2ͷείΞΛେ͖ͨ͘͠Γ͢Δ͜ͱ͕Ͱ͖Δ
ݕࡧ͢Δ
ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -ORݕࡧ: 1,2,3,4,5,6 pink
Orange blue 6 3 4 5 2 1
φΠʔϒͳݕࡧઓུ - ϙεςΟϯάϦετΛࠪ͢Δํࣜ - TAAT(Term At A Time) - ϙεςΟϯάϦετΛ̍ͭͣͭॲཧ͢Δɻಉ࣌ʹ։͘ϙεςΟϯάϦετͷΧʔι
ϧ͚̍ͭͩɻ - ୯ޠ͝ͱʹࠪ͢Δ - DAAT(Document At A Time) - શ୯ޠͷϙεςΟϯάϦετΛಉ࣌ʹॲཧ͢ΔɻΫΤϦʹؚ·ΕΔ୯ޠͷϙε ςΟϯάϦετͷΧʔιϧΛͯ͢։͖ɼಉ࣌ʹਐΊ͍ͯ͘ɻ - υΩϡϝϯτ͝ͱʹࠪ͢Δ
TAATͰͷANDݕࡧ 1. ϙεςΟϯάϦετͷαΠζ͕࠷খͷͷ(prime)Λબ͠ΛɺAccumulator࡞ [2, 5] 2. amazonͷϦετΛࠪ ɾ2ؚ·Ε͍ͯΔ͔?5ؚ·Ε͍ͯΔ͔?ͷΈͷνΣοΫͰOK netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
TAATͰͷORݕࡧ 1. ͲͷΩʔͰྑ͍ͷͰAccumlatorΛ࡞ [1,2,5] (amazon) 2. ॏෳ͠ͳ͍શͯͷΩʔΛݟͯϚʔδ ✌(‘ω'✌ ) ݪ࢝త
( ✌'ω')✌ netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulated = [] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulater = [2] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulate = [2, 5]
netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷORݕࡧ - ΧʔιϧΛಈ͔ͯ͠ɺશͯͷཁૉΛॏෳͳ͘AccumulatorʹՃ ✌(‘ω'✌ ) ݪ࢝త ( ✌'ω')✌ netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͱTAAT - DAATͷϝϦοτ - DAATͷํ͕ɺলϝϞϦͰࡁΉ(ྫ: τοϓ10݅ݕࡧ) - DAATͷํ͕ɺΫΤϦ༻ޠ͕จॻͷಛఆͷ݅Λຬ͍ͨͯ͠Δ͔Ͳ͏͔Λ؆୯ʹࣝ ผͰ͖Δ(ྫ: ϑϨʔζݕࡧɺϑΟϧλϦϯά)
- ElasticsearchͰར༻͞Ε͍ͯΔLuceneDAATํࣜ - ORݕࡧݪ࢝తͳΈͰ͋ΓɺANDݕࡧΑΓଟ͘ͷυΩϡϝϯτΛࠪ͢ΔͨΊɺ ॏ͍ͨ - ݕࡧΤϯδϯORݕࡧʹ࠷దԽ͞Ε͍ͯΔ
ORݕࡧͷ ࠷దԽ४උ
Ͳ͏ߴԽ͢Δ͔ - DAATΛϕʔεʹվળ͢Δ - ݕࡧ݁Ռ্͕Ґ͚݅ͩඞཁͰ͋Δɺ্ҐʹདྷΔՄೳੑ͕ͳ͍จষͷධՁΛεΩοϓ ͢Δ͜ͱʹΑΓɺॲཧͷߴԽ͕Մೳ - ͍߹Θͤʹର্ͯ͠Ґk݅ͷΈΛऔΓग़͢͜ͱΛtop-k query processingͱݺͿ
จষͷϥϯΫ͚ - ্Ґk݅Λग़ྗ͢ΔͨΊʹɺΫΤϦʹରͯ͠ͲͷจষͷॱҐ͕ߴ͍͔Λܾఆ͢Δඞཁ͕͋ Δ - TF-IDF, Okapi BM25
సஔΠϯσοΫεͷ֦ு - సஔΠϯσοΫεʹରͯ͠ɺ֤୯ޠͷείΞ࠷େͱυΩϡϝϯτ͝ͱͷ୯ޠͷείΞ Λ༩͢Δ - ܗࣜ DocID;Score netflix prime amazon
1;5 1;3 2;1 3;2 4;1 5;2 5;1 2;3 max_score=5 max_score=3 max_score=2
Top-k Query ɾmax-score ɾinterval-based running
max-score - ্ҐkҐʹೖΔͨΊͷείΞ5ඞཁͱ͢Δͱɺ”amazon”,”prime”ͷmax_score5ະຬͳͷ Ͱɺ”amazon”ͷΈ”prime”ͷΈΛؚΉจষީิ͔Β֎ΕΔ netflix prime amazon max_score=5 max_score=3 max_score=2
1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - ্Ґ1݅Λऔಘ͍ͨ͠ - จॻ1: score6 - ඞͣ”netflix”ΛؚΉυΩϡϝϯτ͡Όͳ͍ͱμϝ netflix prime
amazon max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - “netflix”ΛؚΉจॻ5·ͰΧʔιϧΛඈ͢ - จॻ5είΞ4 - จॻ1,5ͷΈΛධՁ͢Δ͚ͩͰऴྃ netflix prime amazon
max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 2;3 5;2 3;3 5;1 5;1
Interval-base - max_scoreΑΓ͞ΒʹεΩοϓͰ͖Δ - ͕͜͜ΒΜͰྗਚ͖ͨ࣌ؒͪ͠ΐ͏Ͳ͍͍ͩΖ͏…
LuceneͰ - Lucent 8Ͱmax-scoreͷൃలܗͰ͋ΔWAND͕ΘΕ͍ͯ·͢ - Ding, Shuai, and Torsten Suel.
"Faster top-k document retrieval using block-max indexes." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. - 2019/3/14ϦϦʔεͷLucene 8Ͱ্ͷΞϧΰϦζϜ͕࣮͞ΕΔͳͲࠓͰޮతͳΞϧ ΰϦζϜͷݚڀ͕ଓ͍͍ͯΔ APA
సஔΠϯσοΫεͷ ࣮
·ͱΊ - సஔΠϯσοΫε༷ʑͳϝλใΛՃ͢Δ͜ͱͰ֦ு͞ΕΔ(୯ޠͷΦϑηοτɺε ίΞɺϙΠϯλ) - సஔΠϯσοΫεʹରͯ͠ANDݕࡧɺORݕࡧ͢ΔࡍͷφΠʔϒͳํ๏ - TAAT: ୯ޠ͝ͱʹࠪ͢Δ -
DAAT: จॻ͝ͱʹࠪ͢Δ - DAATʹର͢ΔORݕࡧʹ࠷దԽ - max-score: ୯ޠ͝ͱͷ࠷େείΞͱݱࡏͷ࠷େείΞΛอ࣋͢Δ͜ͱͰࢬמΓΛߦ ͍୳ࡧΛεΩοϓ͢Δ - LuceneͰDAAT͕࠾༻͞Ε͓ͯΓɺݱࡏͰΞϧΰϦζϜ͕վળ͞Ε͍ͯΔ