Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch2系

 Elasticsearch2系

Elasticsearch2系で日本語検索を試す

tsuyoshi nakamura

August 31, 2016
Tweet

More Decks by tsuyoshi nakamura

Other Decks in Technology

Transcript

  1. Install ʙ config •  Java͸yumͰinstall •  Elasticsearch͸ެࣜͷrepositories͔ΒkeyΛinport. •  Yumઃఆͯ͠yum installͰ࠷৽(2.1.1)͕ೖΔ

    •  ೔ຊޠͷશจݕࡧʹඞཁͳpluginΛinstall Kuromoji plugin install bin/plugin install analysis-kuromoji ※https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis- kuromoji.html ※https://github.com/elastic/elasticsearch-analysis-kuromoji
  2. AnalysisϞδϡʔϧ Tokenizer •  τʔΫφΠζํࣜΛઃఆ •  KuromojiΛ࢖ͬͯτʔΫφΠζ͢Δͱ͔ •  ngramࣜʹτʔΫφΠζ͢Δͱ͔ Token Filters

    •  τʔΫφΠζॲཧޙͷτʔΫϯʹରͯ͠Ճ޻ॲཧΛ͢Δ •  શ֯ӳ਺ࣈΛ൒֯ʹ௚͠ɼ൒֯ΧλΧφΛશ֯ʹ௚͢ͱ͔ Char Filters •  τʔΫφΠζॲཧલͷจࣈʹରͯ͠Ճ޻ॲཧΛ͢Δ •  ه߸ͩͬͨΓɺʮʑʯͩͬͨΓΛআڈ͢Δ࣌ʹ࢖͏
  3. ओཁϞδϡʔϧ Ngram Tkenizer •  N-άϥϜͰτʔΫφΠζɻElasticsearchʹ͋Δ cjk_width Token Filter •  ൒֯શ֯Λ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ

    Lowercase Token Filter •  ӳࣈͷେจࣈখจࣈΛ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ Synonym Token Filter •  ಉٛޠΛ݁ͼ͚ͭΔϑΟϧλɻElasticsearchʹ͋Δ Stop Token Filter •  ೚ҙͷϫʔυΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ HTML Strip Char Filter •  HTMLλάΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ
  4. ࠓճ࡞ͬͨconfig(elasticsearch.yml) # ---------------------------------- Index ----------------------------------- index : analysis : analyzer

    : ja : type : custom tokenizer : ja_tokenizer char_filter : [ html_strip, kuromoji_iteration_mark ] filter : [ lowercase, cjk_width, katakana_stemmer, kuromoji_part_of_speech ] ja_ngram : type : custom tokenizer : ngram_ja_tokenizer char_filter : [html_strip] filter : [ cjk_width, lowercase ] tokenizer : ja_tokenizer : type : kuromoji_tokenizer mode : search user_dictionary : /etc/elasticsearch/userdict_ja.txt ngram_ja_tokenizer : type : nGram min_gram : 2 max_gram : 3 token_chars : [letter, digit] filter : katakana_stemmer : type : kuromoji_stemmer
  5. ࠓճ࡞ͬͨindex mapping { "order": 0, "template": "projects01-*", "settings": { "index":

    { "number_of_shards": "1", "number_of_replicas": "0" } }, "mappings": { "project": { "_source": { "enabled": false }, "_all": { "analyzer": "ja", "enabled": true }, "properties": { "update_time": { "format": "YYYY-MM-dd HH:mm:ss", "type": "date" }, "project_id": { "index": "not_analyzed", "type": "string" }, "detail": { "analyzer": "ja", "type": "string" }, "suggest": { "search_analyzer": "ja", "analyzer": "ja", "type": "completion" }, "detail_ngram": { "analyzer": "ja_ngram", "type": "string" }, "title": { "analyzer": "ja", "type": "string" }, "title_ngram": { "analyzer": "ja_ngram", "type": "string" } } } }, "aliases": { }
  6. ·ͩௐ͕ࠪඞཁͳՕॴ •  Indexͷӡ༻ɺߋ৽ϑϩʔ •  Pyhon curator •  Score •  ಉ͡ݕࡧͰ΋ݱࡏਐߦ͍ͯ͠ΔPJΛݕࡧ݁Ռͷ্Ґ΁Έ͍ͨͳཁ݅

    ͕ग़͖ͯͦ͏ •  SlowΫΤϦͱ͔ͷᮢ஋ •  ES_HEAP_SIZEɺεϫοϓ •  clusterɺshardɺreplica •  IndexͷόοΫΞοϓɺϦετΞ •  Pyhon੡ͷtoolɺ_snapshotɺόΠφϦόοΫΞοϓ •  Facet? AggregationsͰ͍͚Δʁ