Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch2系

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

 Elasticsearch2系

Elasticsearch2系で日本語検索を試す

Avatar for tsuyoshi nakamura

tsuyoshi nakamura

August 31, 2016
Tweet

More Decks by tsuyoshi nakamura

Other Decks in Technology

Transcript

  1. Install ʙ config •  Java͸yumͰinstall •  Elasticsearch͸ެࣜͷrepositories͔ΒkeyΛinport. •  Yumઃఆͯ͠yum installͰ࠷৽(2.1.1)͕ೖΔ

    •  ೔ຊޠͷશจݕࡧʹඞཁͳpluginΛinstall Kuromoji plugin install bin/plugin install analysis-kuromoji ※https://www.elastic.co/guide/en/elasticsearch/plugins/master/analysis- kuromoji.html ※https://github.com/elastic/elasticsearch-analysis-kuromoji
  2. AnalysisϞδϡʔϧ Tokenizer •  τʔΫφΠζํࣜΛઃఆ •  KuromojiΛ࢖ͬͯτʔΫφΠζ͢Δͱ͔ •  ngramࣜʹτʔΫφΠζ͢Δͱ͔ Token Filters

    •  τʔΫφΠζॲཧޙͷτʔΫϯʹରͯ͠Ճ޻ॲཧΛ͢Δ •  શ֯ӳ਺ࣈΛ൒֯ʹ௚͠ɼ൒֯ΧλΧφΛશ֯ʹ௚͢ͱ͔ Char Filters •  τʔΫφΠζॲཧલͷจࣈʹରͯ͠Ճ޻ॲཧΛ͢Δ •  ه߸ͩͬͨΓɺʮʑʯͩͬͨΓΛআڈ͢Δ࣌ʹ࢖͏
  3. ओཁϞδϡʔϧ Ngram Tkenizer •  N-άϥϜͰτʔΫφΠζɻElasticsearchʹ͋Δ cjk_width Token Filter •  ൒֯શ֯Λ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ

    Lowercase Token Filter •  ӳࣈͷେจࣈখจࣈΛ౷Ұ͢ΔϑΟϧλɻElasticsearchʹ͋Δ Synonym Token Filter •  ಉٛޠΛ݁ͼ͚ͭΔϑΟϧλɻElasticsearchʹ͋Δ Stop Token Filter •  ೚ҙͷϫʔυΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ HTML Strip Char Filter •  HTMLλάΛআڈ͢ΔϑΟϧλɻElasticsearchʹ͋Δ
  4. ࠓճ࡞ͬͨconfig(elasticsearch.yml) # ---------------------------------- Index ----------------------------------- index : analysis : analyzer

    : ja : type : custom tokenizer : ja_tokenizer char_filter : [ html_strip, kuromoji_iteration_mark ] filter : [ lowercase, cjk_width, katakana_stemmer, kuromoji_part_of_speech ] ja_ngram : type : custom tokenizer : ngram_ja_tokenizer char_filter : [html_strip] filter : [ cjk_width, lowercase ] tokenizer : ja_tokenizer : type : kuromoji_tokenizer mode : search user_dictionary : /etc/elasticsearch/userdict_ja.txt ngram_ja_tokenizer : type : nGram min_gram : 2 max_gram : 3 token_chars : [letter, digit] filter : katakana_stemmer : type : kuromoji_stemmer
  5. ࠓճ࡞ͬͨindex mapping { "order": 0, "template": "projects01-*", "settings": { "index":

    { "number_of_shards": "1", "number_of_replicas": "0" } }, "mappings": { "project": { "_source": { "enabled": false }, "_all": { "analyzer": "ja", "enabled": true }, "properties": { "update_time": { "format": "YYYY-MM-dd HH:mm:ss", "type": "date" }, "project_id": { "index": "not_analyzed", "type": "string" }, "detail": { "analyzer": "ja", "type": "string" }, "suggest": { "search_analyzer": "ja", "analyzer": "ja", "type": "completion" }, "detail_ngram": { "analyzer": "ja_ngram", "type": "string" }, "title": { "analyzer": "ja", "type": "string" }, "title_ngram": { "analyzer": "ja_ngram", "type": "string" } } } }, "aliases": { }
  6. ·ͩௐ͕ࠪඞཁͳՕॴ •  Indexͷӡ༻ɺߋ৽ϑϩʔ •  Pyhon curator •  Score •  ಉ͡ݕࡧͰ΋ݱࡏਐߦ͍ͯ͠ΔPJΛݕࡧ݁Ռͷ্Ґ΁Έ͍ͨͳཁ݅

    ͕ग़͖ͯͦ͏ •  SlowΫΤϦͱ͔ͷᮢ஋ •  ES_HEAP_SIZEɺεϫοϓ •  clusterɺshardɺreplica •  IndexͷόοΫΞοϓɺϦετΞ •  Pyhon੡ͷtoolɺ_snapshotɺόΠφϦόοΫΞοϓ •  Facet? AggregationsͰ͍͚Δʁ