Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Elasticsearch2系
Search
tsuyoshi nakamura
August 31, 2016
Technology
90
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Elasticsearch2系
Elasticsearch2系で日本語検索を試す
tsuyoshi nakamura
August 31, 2016
More Decks by tsuyoshi nakamura
See All by tsuyoshi nakamura
社内の勉強会で発表した_output_一部抜粋版_.pdf
tsuyoshi
0
500
PHPを少しでも早く_条件はあるよ_.pdf
tsuyoshi
0
87
スタートアップ6年目のレビュー文化.pdf
tsuyoshi
1
2k
PHPを少し深堀るよ.pdf
tsuyoshi
0
390
Reactive_Manifesto.pdf
tsuyoshi
0
88
About_Resilience.pdf
tsuyoshi
1
95
エンジニアの循環ってgood_or_bad_.pdf
tsuyoshi
0
1.3k
スタートアップしてからの失敗の数々
tsuyoshi
0
2.5k
スタートアップエンジニアの役割
tsuyoshi
0
550
Other Decks in Technology
See All in Technology
Microsoft Build Keynoteふりかえり
tomokusaba
0
120
小さくはじめるSLI/SLO ~育てながら組織に定着させる実践知~ / Starting Small with SLI/SLOs: Building Adoption Through Continuous Growth
nari_ex
2
1.1k
Djangoユーザが知っ得なPostgreSQL機能 - 設計の選択肢を増やす / Djang-use-PostgreSQL
soudai
PRO
1
220
Disciplined Vibes: Scaling AI-Assisted Engineering
sheharyar
0
110
ChatworkとBPaaS 異なる特性で学んだAI機能開発の ベストプラクティス
kubell_hr
2
3.4k
手塩にかけりゃいいってもんじゃない
ming_ayami
0
130
作って終わりにしない タイミーのセマンティックレイヤー育成の現在地
chanyou0311
3
2k
やさしいA2A入門
minorun365
PRO
10
1.5k
AIっぽい文章を採点して人間らしく直すアプリを作ってみた
yama3133
2
110
2026 TECHFRESH 畢業分享會 - 開發日常大解密!從領域驅動到企業級上線
line_developers_tw
PRO
0
570
2026 TECHFRESH 畢業分享會 - AI-Native 重塑軟體工程與虛擬講師
line_developers_tw
PRO
0
570
AI駆動開発が変える、大規模開発の前提 ーHuman in the Loop から Human on the Loop へ / AIE2026
visional_engineering_and_design
30
23k
Featured
See All Featured
Making Projects Easy
brettharned
120
6.7k
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
540
What the history of the web can teach us about the future of AI
inesmontani
PRO
1
610
YesSQL, Process and Tooling at Scale
rocio
174
15k
Exploring anti-patterns in Rails
aemeredith
3
400
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
Leadership Guide Workshop - DevTernity 2021
reverentgeek
1
300
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.7k
How to make the Groovebox
asonas
2
2.2k
Making the Leap to Tech Lead
cromwellryan
135
9.9k
Between Models and Reality
mayunak
4
330
Transcript
࣮ફElasticsearch(2.1.1) 2016-01-22 ࣾษڧձ Tsuyoshi Nakamura
全文検索エンジンとしては色々な歴史をたどってきました͕ɺࠓશจݕࡧΤϯδ ϯͱ͍͑”Elas'csearch”͕ྑ͍Έ͍ͨͳײ͡ͰɺAWSʹొ
Agenda • Install͔Βconfigઃఆ • Kuromoji • AnalysisϞδϡʔϧ • ओཁϞδϡʔϧ •
Demo • ௐࠪΓ͠
Install ʙ config • JavayumͰinstall • Elasticsearchެࣜͷrepositories͔ΒkeyΛinport. • Yumઃఆͯ͠yum installͰ࠷৽(2.1.1)͕ೖΔ
• ຊޠͷશจݕࡧʹඞཁͳpluginΛinstall Kuromoji plugin install bin/plugin install analysis-kuromoji ※https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis- kuromoji.html ※https://github.com/elastic/elasticsearch-analysis-kuromoji
Kuromoji • ͷ໊લʢΫϩϞδʁʣʁ༶ࢬͷࠇจࣈʁ • ͔Βͳ͍͚Ͳઈศར • Solr͍ͬͯͨ࣌͡ຊޠ༻ͷࣙॻʢMecabΒChasenΒʣΛࣗͰ ೖΕͯɺɺɺͱ৭ʑͱ໘͚ͩͬͨͨ
AnalysisϞδϡʔϧ Analyzer • ෳઃఆՄೳ • τʔΫφΠζॲཧʢܗଶૉղੳʣ • ϑΟϧλʔॲཧ IndexΛ࡞͢Δ࣌ɺݕࡧ͢Δ࣌͜Μͳॲཧ͕ߦΘΕΔ
AnalysisϞδϡʔϧ Tokenizer • τʔΫφΠζํࣜΛઃఆ • KuromojiΛͬͯτʔΫφΠζ͢Δͱ͔ • ngramࣜʹτʔΫφΠζ͢Δͱ͔ Token Filters
• τʔΫφΠζॲཧޙͷτʔΫϯʹରͯ͠ՃॲཧΛ͢Δ • શ֯ӳࣈΛ֯ʹ͠ɼ֯ΧλΧφΛશ֯ʹ͢ͱ͔ Char Filters • τʔΫφΠζॲཧલͷจࣈʹରͯ͠ՃॲཧΛ͢Δ • ه߸ͩͬͨΓɺʮʑʯͩͬͨΓΛআڈ͢Δ࣌ʹ͏
ओཁϞδϡʔϧ Ngram Tkenizer • N-άϥϜͰτʔΫφΠζɻElasticsearchʹ͋Δ cjk_width Token Filter • ֯શ֯Λ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ
Lowercase Token Filter • ӳࣈͷେจࣈখจࣈΛ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ Synonym Token Filter • ಉٛޠΛ݁ͼ͚ͭΔϑΟϧλɻElasticsearchʹ͋Δ Stop Token Filter • ҙͷϫʔυΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ HTML Strip Char Filter • HTMLλάΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ
ࠓճ࡞ͬͨconfig(elasticsearch.yml) # ---------------------------------- Index ----------------------------------- index : analysis : analyzer
: ja : type : custom tokenizer : ja_tokenizer char_filter : [ html_strip, kuromoji_iteration_mark ] filter : [ lowercase, cjk_width, katakana_stemmer, kuromoji_part_of_speech ] ja_ngram : type : custom tokenizer : ngram_ja_tokenizer char_filter : [html_strip] filter : [ cjk_width, lowercase ] tokenizer : ja_tokenizer : type : kuromoji_tokenizer mode : search user_dictionary : /etc/elasticsearch/userdict_ja.txt ngram_ja_tokenizer : type : nGram min_gram : 2 max_gram : 3 token_chars : [letter, digit] filter : katakana_stemmer : type : kuromoji_stemmer
ࠓճ࡞ͬͨindex mapping { "order": 0, "template": "projects01-*", "settings": { "index":
{ "number_of_shards": "1", "number_of_replicas": "0" } }, "mappings": { "project": { "_source": { "enabled": false }, "_all": { "analyzer": "ja", "enabled": true }, "properties": { "update_time": { "format": "YYYY-MM-dd HH:mm:ss", "type": "date" }, "project_id": { "index": "not_analyzed", "type": "string" }, "detail": { "analyzer": "ja", "type": "string" }, "suggest": { "search_analyzer": "ja", "analyzer": "ja", "type": "completion" }, "detail_ngram": { "analyzer": "ja_ngram", "type": "string" }, "title": { "analyzer": "ja", "type": "string" }, "title_ngram": { "analyzer": "ja_ngram", "type": "string" } } } }, "aliases": { }
Demo
Demo • Elasticsearchͷཧπʔϧ(kopf)ΛݟΔ • ͍Ζ͍Ζػೳ͋Δɻ • Mappingͱ͔͜͜Ͱొͨ͠ • ศརɺ͔͍͍ͬ͜ •
IndexΛ࡞ͬͯΈΔ • ݕࡧͯ͠ΈΔ • αδΣετػೳ(completion)͔ͭͬͯΈΔ
·ͩௐ͕ࠪඞཁͳՕॴ • Indexͷӡ༻ɺߋ৽ϑϩʔ • Pyhon curator • Score • ಉ͡ݕࡧͰݱࡏਐߦ͍ͯ͠ΔPJΛݕࡧ݁Ռͷ্ҐΈ͍ͨͳཁ݅
͕ग़͖ͯͦ͏ • SlowΫΤϦͱ͔ͷᮢ • ES_HEAP_SIZEɺεϫοϓ • clusterɺshardɺreplica • IndexͷόοΫΞοϓɺϦετΞ • Pyhonͷtoolɺ_snapshotɺόΠφϦόοΫΞοϓ • Facet? AggregationsͰ͍͚Δʁ