Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Elasticsearchで多言語検索対応してみた話.pdf
Search
motsat
July 19, 2018
Programming
1.5k
2
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Elasticsearchで多言語検索対応してみた話.pdf
motsat
July 19, 2018
More Decks by motsat
See All by motsat
「SmartHR基本機能」の溜まっていく技術課題への取り組み
motsat
0
1.8k
メドピアの輪読会
motsat
2
1.4k
Other Decks in Programming
See All in Programming
net-httpのHTTP/2対応について
naruse
0
480
作って学ぶ、 JSX (TSX) ランタイムの基本
syumai
7
1.6k
Semantic Version 単位で戦略を柔軟に変えて、パッケージアップデートを自動化する
daitasu
1
230
AI時代の仕事技芸論 — ソフトウェア開発で「遊ぶように働く」職人的熟達のすすめ
kuranuki
2
670
AIで効率化できた業務・日常
ochtum
0
130
その問い、本当に正しいですか?AI時代のエンジニアに必要な哲学と認知科学 / ai-philosophy-cognitive-science
minodriven
7
4.3k
Spring Security 実践 ─ GraphQL APIで実務に役立つ 認証・認可 を学ぶ
wagyu
0
230
TypeScript+Orvalで実現する型安全かつ堅牢でスケーラブルなマルチチャネル通知基盤 / TSKaigi Night talks ~after conference~
d0riven
0
330
Composerを使ったサプライチェーン攻撃の様子を眺めてみる #phpstudy
o0h
PRO
2
250
キャリア迷子上等 ─ "ない道"は自分で作ればいい
16bitidol
3
2.1k
The NotImplementedError Problem in Ruby
koic
1
770
Language Server 使ってる? 〜VSCode と Zed の場合〜 / Are you using a Language Server? ~For VS Code and Zed~
handlename
0
780
Featured
See All Featured
Organizational Design Perspectives: An Ontology of Organizational Design Elements
kimpetersen
PRO
1
720
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.5k
HTML-Aware ERB: The Path to Reactive Rendering @ RubyCon 2026, Rimini, Italy
marcoroth
1
190
Crafting Experiences
bethany
1
180
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
310
Testing 201, or: Great Expectations
jmmastey
46
8.2k
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
540
Leveraging LLMs for student feedback in introductory data science courses - posit::conf(2025)
minecr
1
280
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
2k
Raft: Consensus for Rubyists
vanstee
141
7.5k
Getting science done with accelerated Python computing platforms
jacobtomlinson
2
230
Test your architecture with Archunit
thirion
1
2.3k
Transcript
ElasticsearchͰଟݴޠݕࡧର Ԡͯ͠Έͨ
ࣗݾհ ɹɾ໊લ ɹɹࠤ౻ ݩل ɹɾϝυϐΞྺ ɹɹ 2017/9 ʙ WebͷιϑτΤΞΤϯδχΞɹ ɹ
͜Ε͔Β͓͢͠Δ͜ͱ ɾӳޠυΩϡϝϯτΛຊޠͰݕࡧ͢ΔΑ͏ͳɺҟͳΔݴޠ ؒͷݕࡧ ɾ༁Λߦ͏ࣄ͕Ͱ͖ͳ͍߹ͷରԠ ɾElasticsearchͰߦͬͨࡍͷํ๏ͷ1ͭͱɺ ɹϝϦοτ/σϝϦοτ
ҩࢣઐ༻αΠτʮMedPeerʯ
㲔 ৽αʔϏεͷ։ൃ
ʮPubmedจΛຊޠͰޮΑ͘ݕࡧʯ
PubMedจʁ ɹւ֎ҩֶจݙใͷσʔλϕʔε ɹɾӳޠ ʢ͘͝·Εʹผͷݴޠʣ ɾAPIɺFTPͰͷϑΝΠϧऔಘʹରԠ͍ͯ͠Δ
ɾӳจυΩϡϝϯτΛຊޠͰݕࡧ ͍ͨ͠ ɾݕࡧରλΠτϧɺຊจ ࣮ݱ͢Δ͜ͱ
ຊޠʹ༁͓͚ͯ͠ ͳ͍ͣ
༁Λߦ͏ͨΊͷAPI ɾGoogle Translation API 100 ສจࣈ - 20υϧ ɹˠ Pubmed༁ʹෆࣗવͳ͕গͳ͍
ɾMicrosoft Translator API 100 ສจࣈ - 10υϧʢ1120ԁ) ɹˠ ྉ͍͕ۚ҆ɺPubmed༁͢Δͱෆࣗવͳ͕ΘΓͱ͋Δ ɾAmazon Translate ɹຊޠະରԠ(2017/ळࠒ)
༁࣭Λ༏ઌͯ͠ɺPubmed༁ʹෆ ࣗવͳ͕গͳ͔ͬͨͷͰ Google Translation API ʹܾఆɻ
༁ྉۚ
ɾPubMedจ ɹ1700ສ݅ʢMedPeerͰͷऔΓࠐΈରɻʑ૿Ճʣ ɾฏۉจࣈ ɹ1300จࣈ (λΠτϧ100จࣈɺຊจ1200จࣈ) ɹɹɹɹɹˣɹɹɹɹ ɾ߹ܭจࣈ ɹ221ԯจࣈ (100 +
1200) * 1700ສ݅
Google Translation API100 ສจࣈ - 20υϧ ɾֹۚ (221ԯ / 100ສจࣈ)
* 20υϧ = 442000υϧ ɹ ຊԁ = 4889ສԁ ɹ (2017/07/07࣌)
4889ສԁ ߴ͍ʢฐࣾج४ʣ
4889ສԁߴ͍ ɾશͯ༁͢Δͱߴ͗͢Δ ɾͱ͍͑ɺ݅ݮΒͨ͘͠ͳ͍ → ʮӳจυΩϡϝϯτΛຊޠͰݕࡧʯΛͲ ͏͢Δ͔
1.ݩυΩϡϝϯτΛ༁͍ͯ͠ͳͯ͘ ݕࡧՄೳʹ → ຊޠݕࡧʹࣙॻΛ͏ɻ ɹElasticsearchͷʮSynonym Token Filterʯ
Elasticsearch Synonym Token Filter https://www.elastic.co/guide/en/elasticsearch/reference/current/ analysis-synonym-tokenfilter.html
ಉҙޠྨٛޠΛઃఆͰ͖Δػೳɻ ྫʣ ͱ͍͏ఆ͕ٛ͋Εɺ ʮi-podʯͰݕࡧ →ʮi podʯʯʮipodʯʹώοτ ʮi podʯͰݕࡧ →ʮi-podʯʯʮipodʯʹώοτ Synonym
Token Filter i-pod, i pod => ipod
ߴ݂ѹ => hypertension ΠϯϑϧΤϯβ => influenza ͜ΕΛ͍ɺ ຊޠ/ӳޠΛؔ࿈͚ͮΔ
pubmed: { properties: { title_en: { type: "text", analyzer: “english_analyzer"
}, title_ja: { type: "text", analyzer: "ja_analyzer" }, body_en: { type: “text”, analyzer: "english_analyzer" }, body_ja: { type: "text", analyzer: "ja_analyzer" }, }, }, Indexͷproperties(Ϛοϐϯά) ɹɾຊޠϑΟʔϧυ(title_ja/body_ja)ɺ ɹɹӳޠϑΟʔϧυ(title_en/body_en)ΛλΠτϧ/ຊจ ɹɹͦΕͧΕͰ༻ҙ ɾӳޠϑΟʔϧυɺຊޠϑΟʔϧυͰanalyzerΛ͚Δ Elasticsearch༻ͷઃఆ
Indexͷanalysisઃఆ ɹɾfilterʹtype:”synonym”ͰઃఆՃ ɹɾӳޠϑΟʔϧυ༻ͷʮenglish_analyzerʯͷfillterʹɺ ɹɹsynonym filterΛ͏Α͏ઃఆ(ଞެࣜͷEnglishઃఆΛϕʔεʹ) Elasticsearch༻ͷઃఆ { “index” : {
“analysis“: { “filter“ : { “synonym“ : { “type“ : "synonym", “synonyms“ : [‘ߴ݂ѹ => hypertension’, …]}, “analyzer“: { “english_analyzer“: { “tokenizer”: "standard", “filter”: [“synoncym”,”english_possessive_stemmer”,”lowercase “,…] }, …}, }, }, }
Analyze݁Ռ (kibana) ʮߴ݂ѹʯˠʮhypertensionʯͷtokenͱͳΔ (࣮ࡍʹɺઌఔͷαϯϓϧͷଞͷfilterʹΑΓՃ͞Εͨtokenʣ
Ωʔϫʔυʮߴ݂ѹʯͰຊޠ/ӳޠݕࡧ͕Մೳʹ ɾຊޠͷʮߴ݂ѹʯ ɾӳจͷʮhypertensionʯ
ࣙॻͲ͜ͰखʹೖΕΔ͔ ༗໊ͳࣙॻ ɾJMdict (Japanese-Multilingual Dictionary) http://www.edrdg.org/jmdict/j_jmdict.html ɹӳޠҎ֎ೖ͍ͬͯΔΑ͏ͳͷͰɺ ɹෳݴޠͷઃఆͰ͖Δ͔͠Ε·ͤΜɻ
ͲΜͳࣙॻͰྑ͍ͷ͔ ɾઐ༻ޠݫ͍͠ ɹઐ༻ޠ(ҩྍ)ઐͷࣙॻͰͳ͍ͱ୯ޠ͕ཏͰ͖ͳ͍ࣄ͕ ଟ͍ɻ ɹˠ ઐྖҬͷࣙॻΛ୳͢ɺݕࡧϩά͔ΒࣙॻΛΞοϓσʔτ͢Δ ͳͲ ϝυϐΞגࣜձࣾϩθολ༷ͱܖ͠ɺҩྍʹಛԽͨࣙ͠ॻΛఏ ڙ͖ͯ͠·ͨ͠ɻ
ࣙॻʹΑΔଟݴޠͷݕࡧ ϝϦοτ ɹɾશ༁͠ͳͯ͘ݕࡧͰ͖ΔΑ͏ʹͳΔ σϝϦοτ ɹɾࣙॻͷ༻ҙ͕ඞཁʹͳΔ ɹɾݕࡧਫ਼͕ࣙॻͷ࣭ʹࠨӈ͞Ε͍͢ ɹɹˠ จষͷ୯ޠΛཏͰ͖ͳ͍Մೳੑେ͖͍ ɾ୯ޠϕʔεͷݕࡧʹڧ͍͕ɺจͷݕࡧʹऑ͍ ɾݕࡧͰ͖Δ͕ɺ݁Ռදࣔӳޠ
Ϣʔβʔ ΠϯλʔϑΣʔε͍ͩ͠Ͱ͕͢ɺ ɾλΠτϧࣄલ ɹ༁APIͷ͕ͪ࣌ؒͭΒ͍ʢҰཡͰͬ͞ͱݟΔఔͳͷͰʣ ɹຊจΑΓจࣈগͳ͍ͷͰɺશ༁ֹͯۚ͠Ί 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ɾຊจৄࡉදࣔ࣌ ɹදࣔϖʔδͰඇಉظͰॲཧ͞Εͨ༁݁ՌΛදࣔɹ 2. Ұ෦ࣄલ༁ɺΓදࣔ࣌ʹ
ྉۚ ɾ༁ྉ(ݻఆ) ɹ ࣄલʹλΠτϧ͚ͩ༁͢Δࣄʹͨ͠ ɹ 1700ສ݅ * λΠτϧͷΈ(100จࣈ) = 376ສԁɹ(Google
Translate API) ɾ༁ྉ(มಈ) ɹදࣔ࣌ʹ༁͢Δࣄʹͨ͠ = αສԁʢPVඇެදͷؔͰग़ ͤͳ͍Ͱ͕͢ɺݕࡧ͞Εදࣔ͞ΕΔϖʔδҰ෦ͳͷͰ͔ͳΓ҆͘ʣ ɾࣙॻ ɹ גࣜձࣾϩθολ༷ (ҩྍܥಛԽͷӳࣙॻ)ɹ= bສԁʢ͜Εެ ද☓ͳͷͰ͕͢ɺۃΊ͓ͯ҆͘ఏڙ͍͍ͯ·͢ʣ
࠷ऴతͳֹۚ 4889ສԁ → 376+α+b ສԁ େৎʢฐࣾج४ʣ
ิ 1700ສ݅Ҏ߱ʢPubMedͷίϯςϯπʑ૿Ճʣ ͷ༁ɺΑΓҩྍಛԽͷ༁Λߦ͏ͨΊGoogle༁͚ͩͰ ͳ͘ɺԼهͷύʔτφʔ༷ͱܖΛߦ͍APIΛར༻͍ͯ͠· ͢ɻ ɾגࣜձࣾγΣΞϝσΟΧϧ༷ ɹҩྍܥಛԽͷ༁API ɹhttps://www.ikotoba.jp/ ɾגࣜձࣾϩθολ༷ PubMedಛԽͷ༁API
ɹhttps://www.rozetta.jp/
·ͱΊ - ݴޠ͕ҟͳΔυΩϡϝϯτݕࡧʹɺ ɹ ElasticsearchͷSynonym Token FilterΛͬͨ - ༁ྉݮΒͤΔ͕ɺϝϦοτ/σϝϦοτ͋Γ
- ୯ޠͰͷݕࡧΛओͱ͢Δ߹ʹద༻͍͢͠
͝ਗ਼ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠