Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
転置インデックスでどう検索しているか
Search
kotaroooo0
November 19, 2020
Technology
0
220
転置インデックスでどう検索しているか
kotaroooo0
November 19, 2020
Tweet
Share
More Decks by kotaroooo0
See All by kotaroooo0
検索エンジン自作入門 Go Conference 2021 Spring
kotaroooo0
17
7.1k
俺の全文検索エンジン(Go製)を作り始めた
kotaroooo0
0
99
ぼくのかんがえたさいきょうのDocker Build
kotaroooo0
0
79
Other Decks in Technology
See All in Technology
Can We Measure Developer Productivity?
ewolff
1
150
Lambda10周年!Lambdaは何をもたらしたか
smt7174
2
110
個人でもIAM Identity Centerを使おう!(アクセス管理編)
ryder472
3
200
フルカイテン株式会社 採用資料
fullkaiten
0
40k
The Role of Developer Relations in AI Product Success.
giftojabu1
0
120
社内で最大の技術的負債のリファクタリングに取り組んだお話し
kidooonn
1
550
AWS Media Services 最新サービスアップデート 2024
eijikominami
0
200
【若手エンジニア応援LT会】ソフトウェアを学んできた私がインフラエンジニアを目指した理由
kazushi_ohata
0
150
IBC 2024 動画技術関連レポート / IBC 2024 Report
cyberagentdevelopers
PRO
0
110
Lexical Analysis
shigashiyama
1
150
OCI Vault 概要
oracle4engineer
PRO
0
9.7k
BLADE: An Attempt to Automate Penetration Testing Using Autonomous AI Agents
bbrbbq
0
300
Featured
See All Featured
Building Adaptive Systems
keathley
38
2.3k
Mobile First: as difficult as doing things right
swwweet
222
8.9k
For a Future-Friendly Web
brad_frost
175
9.4k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
38
1.8k
Designing for Performance
lara
604
68k
Designing Experiences People Love
moore
138
23k
Building Flexible Design Systems
yeseniaperezcruz
327
38k
Building Applications with DynamoDB
mza
90
6.1k
A Tale of Four Properties
chriscoyier
156
23k
Adopting Sorbet at Scale
ufuk
73
9.1k
Bootstrapping a Software Product
garrettdimon
PRO
305
110k
jQuery: Nuts, Bolts and Bling
dougneiner
61
7.5k
Transcript
2020/11/19 @kotaroooo0 సஔΠϯσοΫεͰ Ͳ͏ݕࡧ͍ͯ͠Δ͔
ࣗݾհ
సஔΠϯσοΫεɺͪΖΜͬͯΔΑ? grepΈͨ͘ஞ࣍ݕࡧͯ͠ΔͱΊͪΌͪ͘Ό͕͔͔࣌ؒΔ͔ ΒɺݕࡧΛૣ͘͢ΔͨΊʹࣄલʹຊͷ࣍Έ͍ͨͳͷΛ ࡞͓ͬͯ͘ΜͰ͠ΐ? ͰɺͲ͏ͬͯݕࡧ͍ͯ͠Δ͔·Ͱ… ఆฉ͖ख
సஔΠϯσοΫεΛ Δ
సஔΠϯσοΫε netflix prime amazon - సஔΠϯσοΫε = ࣙॻ + సஔϦετ
1 1 2 3 4 5 5 ࣙॻ సஔϦετ ϙεςΟϯάϦετ
సஔΠϯσοΫε netflix prime amazon - Word-level inverted list ͱݺΕɺ୯ޠ͕จॻͷԿ୯ޠ͔อଘ͢Δ͜ͱ͋Δ -
DocID;offset1,ofset2… 1;2 1;3 2;3 3;5 4;1 5;2 5;3 2;5
୯ޠͷҐஔใͳʹʹ͏? - ϑϨʔζΛ୳͢߹ - ʮAmazon Primeʯͱݕࡧͨ͠߹ - D1: “a prime
concern of Amazon” - D2: “Amazon Prime movies” - Ґஔใ͕͋Ε୯ޠͷॱংΛߟྀ͢Δ͜ͱ͕Ͱ͖ΔͷͰɺD2ͷΈΛώοτ͞ ͤΔ͜ͱ͕Ͱ͖ͨΓɺD2ͷείΞΛେ͖ͨ͘͠Γ͢Δ͜ͱ͕Ͱ͖Δ
ݕࡧ͢Δ
ANDݕࡧͱORݕࡧ -ΫΤϦ: “pink orange blue” -ANDݕࡧ: 3 -ORݕࡧ: 1,2,3,4,5,6 pink
Orange blue 6 3 4 5 2 1
φΠʔϒͳݕࡧઓུ - ϙεςΟϯάϦετΛࠪ͢Δํࣜ - TAAT(Term At A Time) - ϙεςΟϯάϦετΛ̍ͭͣͭॲཧ͢Δɻಉ࣌ʹ։͘ϙεςΟϯάϦετͷΧʔι
ϧ͚̍ͭͩɻ - ୯ޠ͝ͱʹࠪ͢Δ - DAAT(Document At A Time) - શ୯ޠͷϙεςΟϯάϦετΛಉ࣌ʹॲཧ͢ΔɻΫΤϦʹؚ·ΕΔ୯ޠͷϙε ςΟϯάϦετͷΧʔιϧΛͯ͢։͖ɼಉ࣌ʹਐΊ͍ͯ͘ɻ - υΩϡϝϯτ͝ͱʹࠪ͢Δ
TAATͰͷANDݕࡧ 1. ϙεςΟϯάϦετͷαΠζ͕࠷খͷͷ(prime)Λબ͠ΛɺAccumulator࡞ [2, 5] 2. amazonͷϦετΛࠪ ɾ2ؚ·Ε͍ͯΔ͔?5ؚ·Ε͍ͯΔ͔?ͷΈͷνΣοΫͰOK netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
TAATͰͷORݕࡧ 1. ͲͷΩʔͰྑ͍ͷͰAccumlatorΛ࡞ [1,2,5] (amazon) 2. ॏෳ͠ͳ͍શͯͷΩʔΛݟͯϚʔδ ✌(‘ω'✌ ) ݪ࢝త
( ✌'ω')✌ netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulated = [] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulater = [2] netflix
prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷANDݕࡧ - Amazon AND prime - Accumulate = [2, 5]
netflix prime amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͰͷORݕࡧ - ΧʔιϧΛಈ͔ͯ͠ɺશͯͷཁૉΛॏෳͳ͘AccumulatorʹՃ ✌(‘ω'✌ ) ݪ࢝త ( ✌'ω')✌ netflix prime
amazon 1;2 1;3 2;3 4;1 5;2 5;3 2;5
DAATͱTAAT - DAATͷϝϦοτ - DAATͷํ͕ɺলϝϞϦͰࡁΉ(ྫ: τοϓ10݅ݕࡧ) - DAATͷํ͕ɺΫΤϦ༻ޠ͕จॻͷಛఆͷ݅Λຬ͍ͨͯ͠Δ͔Ͳ͏͔Λ؆୯ʹࣝ ผͰ͖Δ(ྫ: ϑϨʔζݕࡧɺϑΟϧλϦϯά)
- ElasticsearchͰར༻͞Ε͍ͯΔLuceneDAATํࣜ - ORݕࡧݪ࢝తͳΈͰ͋ΓɺANDݕࡧΑΓଟ͘ͷυΩϡϝϯτΛࠪ͢ΔͨΊɺ ॏ͍ͨ - ݕࡧΤϯδϯORݕࡧʹ࠷దԽ͞Ε͍ͯΔ
ORݕࡧͷ ࠷దԽ४උ
Ͳ͏ߴԽ͢Δ͔ - DAATΛϕʔεʹվળ͢Δ - ݕࡧ݁Ռ্͕Ґ͚݅ͩඞཁͰ͋Δɺ্ҐʹདྷΔՄೳੑ͕ͳ͍จষͷධՁΛεΩοϓ ͢Δ͜ͱʹΑΓɺॲཧͷߴԽ͕Մೳ - ͍߹Θͤʹର্ͯ͠Ґk݅ͷΈΛऔΓग़͢͜ͱΛtop-k query processingͱݺͿ
จষͷϥϯΫ͚ - ্Ґk݅Λग़ྗ͢ΔͨΊʹɺΫΤϦʹରͯ͠ͲͷจষͷॱҐ͕ߴ͍͔Λܾఆ͢Δඞཁ͕͋ Δ - TF-IDF, Okapi BM25
సஔΠϯσοΫεͷ֦ு - సஔΠϯσοΫεʹରͯ͠ɺ֤୯ޠͷείΞ࠷େͱυΩϡϝϯτ͝ͱͷ୯ޠͷείΞ Λ༩͢Δ - ܗࣜ DocID;Score netflix prime amazon
1;5 1;3 2;1 3;2 4;1 5;2 5;1 2;3 max_score=5 max_score=3 max_score=2
Top-k Query ɾmax-score ɾinterval-based running
max-score - ্ҐkҐʹೖΔͨΊͷείΞ5ඞཁͱ͢Δͱɺ”amazon”,”prime”ͷmax_score5ະຬͳͷ Ͱɺ”amazon”ͷΈ”prime”ͷΈΛؚΉจষީิ͔Β֎ΕΔ netflix prime amazon max_score=5 max_score=3 max_score=2
1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - ্Ґ1݅Λऔಘ͍ͨ͠ - จॻ1: score6 - ඞͣ”netflix”ΛؚΉυΩϡϝϯτ͡Όͳ͍ͱμϝ netflix prime
amazon max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 5;1 2;3 5;2 5;1 3;3
max-score - “netflix”ΛؚΉจॻ5·ͰΧʔιϧΛඈ͢ - จॻ5είΞ4 - จॻ1,5ͷΈΛධՁ͢Δ͚ͩͰऴྃ netflix prime amazon
max_score=5 max_score=3 max_score=2 1;5 1;1 2;1 4;1 2;3 5;2 3;3 5;1 5;1
Interval-base - max_scoreΑΓ͞ΒʹεΩοϓͰ͖Δ - ͕͜͜ΒΜͰྗਚ͖ͨ࣌ؒͪ͠ΐ͏Ͳ͍͍ͩΖ͏…
LuceneͰ - Lucent 8Ͱmax-scoreͷൃలܗͰ͋ΔWAND͕ΘΕ͍ͯ·͢ - Ding, Shuai, and Torsten Suel.
"Faster top-k document retrieval using block-max indexes." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 2011. - 2019/3/14ϦϦʔεͷLucene 8Ͱ্ͷΞϧΰϦζϜ͕࣮͞ΕΔͳͲࠓͰޮతͳΞϧ ΰϦζϜͷݚڀ͕ଓ͍͍ͯΔ APA
సஔΠϯσοΫεͷ ࣮
·ͱΊ - సஔΠϯσοΫε༷ʑͳϝλใΛՃ͢Δ͜ͱͰ֦ு͞ΕΔ(୯ޠͷΦϑηοτɺε ίΞɺϙΠϯλ) - సஔΠϯσοΫεʹରͯ͠ANDݕࡧɺORݕࡧ͢ΔࡍͷφΠʔϒͳํ๏ - TAAT: ୯ޠ͝ͱʹࠪ͢Δ -
DAAT: จॻ͝ͱʹࠪ͢Δ - DAATʹର͢ΔORݕࡧʹ࠷దԽ - max-score: ୯ޠ͝ͱͷ࠷େείΞͱݱࡏͷ࠷େείΞΛอ࣋͢Δ͜ͱͰࢬמΓΛߦ ͍୳ࡧΛεΩοϓ͢Δ - LuceneͰDAAT͕࠾༻͞Ε͓ͯΓɺݱࡏͰΞϧΰϦζϜ͕վળ͞Ε͍ͯΔ