Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch での類似文書検索と More Like This Query API 詳解

Elasticsearch での類似文書検索と More Like This Query API 詳解

Elasticsearch 勉強会 in 京都での発表スライド
(Elasticsearch 勉強会 in 大阪での発表と同じ内容です)

Takuya Asano

July 14, 2015
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. Elasticsearch Ͱͷྨࣅจॻݕࡧͱ More Like This Query API ৄղ Elasticsearch ษڧձ

    in ژ౎ @ ͸ͯͳژ౎ΦϑΟε ઙ໺ ୎໵ id:takuya-a
 @takuya_a
  2. id:takuya-a ϓϥοτϑΥʔϜˍΞυςΫνʔϜ • 2015 ೥ 4 ݄ʹೖࣾ • ͸ͯͳϒοΫϚʔΫશจݕࡧͷվળΛ୲౰ ڵຯ

    • ৘ใݕࡧ • ࣗવݴޠॲཧ • ػցֶश OSS ׆ಈ • kuromoji.js ͳͲͷ JavaScript ϥΠϒϥϦΛ։ൃ
  3. ྨࣅจॻݕࡧ (More Like This) ͱ͸ ೖྗɿจॻ ग़ྗɿจॻू߹
 ʢྨࣅจॻʣ ʮElasticsearch 1.6.0ϦϦʔεʯ

    ʮElasticsearch 2.0.0.beta1ϦϦʔεؒۙʯ ʮElasticsearch CheatSheetʯ ʮ[຋༁] From Solr to elasticsearchʯ ʮelasticsearchษڧձ | Doorkeeperʯ
  4. ຊ೔ͷτʔΫͷ֓ཁ 1. More Like This Query API ʹΑΔྨࣅจॻݕࡧ • Elasticsearch

    ʹඪ४ͰඋΘ͍ͬͯΔྨࣅจॻݕࡧ API • More Like This Query API ʹ͍ͭͯৄࡉͳΞϧΰϦζϜΛ঺հ
 ʢElasticsearch ͕ྨࣅจॻΛͲ͏΍ͬͯܭࢉ͍ͯ͠Δ͔ʣ 2. Term Vectors API ʹΑΔྨࣅจॻݕࡧ • υΩϡϝϯτʹೖ͍ͬͯΔޠͷ౷ܭ৘ใΛऔಘ͢Δ API • ΑΓॊೈʹΧελϚΠζ͕Մೳ
 ʢޠͷ౷ܭ৘ใ͔Βࣗ෼Ͱܭࢉ͢Δʣ
  5. More Like This Query API https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-mlt-query.html ೖྗɿจॻ ग़ྗɿจॻू߹
 ʢྨࣅจॻʣ ʮElasticsearch

    1.6.0ϦϦʔεʯ ʮElasticsearch 2.0.0.beta1ϦϦʔεؒۙʯ ʮElasticsearch CheatSheetʯ ʮ[຋༁] From Solr to elasticsearchʯ ʮelasticsearchษڧձ | Doorkeeperʯ ͜ΕΛ࣮ݱͰ͖Δ API
  6. More Like This Query API https://www.elastic.co/guide/en/elasticsearch/reference/1.x/query-dsl-mlt-query.html ϑΟʔϧυΛࢦఆ จॻ ID Λࢦఆ

    ύϥϝʔλΛࢦఆ ໊લ͕ࣅ͍ͯΔͷͰฆΒΘ͍͕͠ Search API ͷ More Like This API ͸ 1.6.0 Ͱ Deprecated ɺ 2.0 Ͱഇࢭ༧ఆ https://www.elastic.co/guide/en/elasticsearch/reference/current/search-more-like-this.html
  7. ྨࣅจॻݕࡧͷཧ࿦తഎܠ
 - Bag of words Ϟσϧ - • จॻʢυΩϡϝϯτʣΛ୯ޠͷू߹ (Set)

    ͩͱΈͳ͢
 (୯ޠͷॱং͸ؾʹ͠ͳ͍) • Bag of words (BOW) ʹΑΔϕΫτϧۭؒϞσϧΛ
 جຊతͳߟ͑ํͱ͢Δ ಉ͡ޠΛଟؚ͘Ήจॻ = ྨࣅจॻͱఆٛ
  8. Bag of Words ϞσϧʹΑΔ
 ྨࣅจॻͷྫ ʮElasticsearch 1.6.0ϦϦʔεʯ ʮElasticsearch 2.0.0.beta1ϦϦʔεؒۙʯ Elasticsearch

    Elasticsearch ϦϦʔε ϦϦʔε Lucene Lucene synced-flush Ϋϥελ Ϋϥελ … … Pipeline ಉ͡ޠ͕ͲΕ͚ͩೖͬͯΔ͔ʁ Λܭࢉ
  9. ྨࣅจॻݕࡧ
 - จॻͷಛ௃நग़ - 1ͭͷจॻͷதʹ͸ɺͨ͘͞ΜͷޠʢλʔϜʣؚ͕·Ε͍ͯΔ • ͢΂ͯͷλʔϜɾ͢΂ͯͷจॻʹରͯ͠ɺྨࣅ౓ͷܭࢉΛ͢Δͷ͸ݱ࣮తͰ͸ͳ͍ • ͍͔ͭ͘ͷॏཁޠΛબ୒͠ɺͦͷจॻͷಛ௃ྔͱ͢Δ
 ʢBag

    of Words Ͱͷྨࣅ౓ܭࢉΛ ॏཁͳλʔϜʹ͍͚ͭͯͩܭࢉ͢Δ͜ͱͰۙࣅ ͍ͯ͠Δʣ ʮॏཁޠΛͲͷΑ͏ʹબͿ͔ʯ͕ྨࣅจॻݕࡧͷϙΠϯτ ౷ܭ৘ใ͔ΒλʔϜͷॏཁ౓ʢείΞʣΛܭࢉͯ͠ɺ্Ґ Top-K ͷλʔϜΛબ୒͢Δ ʢλʔϜͷείΞؔ਺ͷྫɿ TF-IDF, IDF, RIDF, Gain ͳͲʣ
  10. More Like This Query API ͷΞϧΰϦζϜ
 - λʔϜͷ౷ܭ৘ใͷऔಘ - IndexReader

    Ϋϥε Λ࢖͏ • Lucene ͷసஔΠϯσοΫεΛಡΉͨΊͷΫϥε
 ʢ௨ৗͷݕࡧͰ΋࢖ΘΕΔʣ • λʔϜͷ౷ܭ৘ใ΋औಘͰ͖Δ • IndexReader#getTermVectors() ʹจॻ ID Λ༩͑ͯ
 term vector Λऔಘ
  11. More Like This Query API ͷΞϧΰϦζϜ - ॲཧͷྲྀΕ - 1.

    จॻʹؚ·ΕΔλʔϜͷ౷ܭ৘ใʢTF, DFͳͲʣΛऔಘ 2. ౷ܭ৘ใ͔Β֤λʔϜͷείΞΛܭࢉʢTF-IDFͳͲͰείΞϦϯάʣ 3. λʔϜΛείΞॱʹฒͼସ্͑ͯҐͷλʔϜʢ=ॏཁޠʣΛநग़ 4. ॏཁޠΛ࢖ͬͯ OR ΫΤϦΛ࡞੒ 5. ࡞੒ͨ͠ΫΤϦͰݕࡧͯ͠ɺ্ҐͷจॻΛྨࣅจॻͱͯ͠ฦ͢
  12. More Like This Query API ͷΞϧΰϦζϜ
 - λʔϜͷείΞܭࢉ - More

    Like This ͸ɺλʔϜͷείΞؔ਺ʹ TF-IDF Λ࢖͍ͬͯΔ 1. term vector ͔ΒɺTF ΛΧ΢ϯτ 2. IDF ͷܭࢉʹ͸TFIDFSimilarity Ϋϥε Λ࢖͏ 3. TF * IDF ΛɺͦͷλʔϜͷείΞͱ͢Δ TF : λʔϜස౓ fi,j → จॻதʹԿճ΋ग़ͯ͘ΔλʔϜ΄Ͳߴ͘ͳΔ IDF : จॻස౓ ni ͷٯ਺ʢN ͸จॻ਺ʣ→ ϨΞͳλʔϜ΄Ͳߴ͘ͳΔ fi,j : ΤϯτϦʔ j ʹݱΕΔλʔϜͷग़ݱճ਺ʢස౓ʣ = term_freq N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
  13. More Like This Query API ͷΞϧΰϦζϜ - ॲཧͷྲྀΕ - 1.

    จॻʹؚ·ΕΔλʔϜͷ౷ܭ৘ใʢTF, DFͳͲʣΛऔಘ 2. ౷ܭ৘ใ͔Β֤λʔϜͷείΞΛܭࢉʢTF-IDFͳͲͰείΞϦϯάʣ 3. λʔϜΛείΞॱʹฒͼସ্͑ͯҐͷλʔϜʢ=ॏཁޠʣΛநग़ 4. ॏཁޠΛ࢖ͬͯ OR ΫΤϦΛ࡞੒ 5. ࡞੒ͨ͠ΫΤϦͰݕࡧͯ͠ɺ্ҐͷจॻΛྨࣅจॻͱͯ͠ฦ͢
  14. More Like This Query API ͷΞϧΰϦζϜ
 - λʔϜΛείΞॱͰฒͼସ͑ - λʔϜͷ਺͸͔ͳΓଟ͍

    ͢΂ͯͷλʔϜΛϝϞϦʹ͓͍ͯιʔτ͢Δͷ͸
 ۭؒతʹ΋࣌ؒతʹ΋ܭࢉίετ͕ߴ͍ Top-K ͕ཉ͍͠৔߹ ্Ґ K ݸ͚ͩΛอ͓͍࣋ͯͯ͠
 ͋ͱ͸ࣺͯͳ͕Βιʔτ
  15. ༏ઌ౓͖ͭΩϡʔ (PriorityQueue) ߴ଎ʹ Top-K ΛٻΊΒΕΔσʔλߏ଄ • Java Ͱ͸ඪ४ϥΠϒϥϦʹೖ͍ͬͯΔ • όΠφϦώʔϓʢ̎෼ώʔϓʣͰ࣮૷͢Δ͜ͱ͕ଟ͍

    • Elasticsearch Ͱ͸ɺ Lucene ͷ util ύοέʔδʹ͋Δ
 org.apache.lucene.util.PriorityQueue ΫϥεΛ࢖͍ͬͯΔ https://github.com/apache/lucene-solr/blob/trunk/lucene/core/src/java/org/apache/lucene/util/PriorityQueue.java
  16. More Like This Query API ͷΞϧΰϦζϜ - ॲཧͷྲྀΕ - 1.

    จॻʹؚ·ΕΔλʔϜͷ౷ܭ৘ใʢTF, DFͳͲʣΛऔಘ 2. ౷ܭ৘ใ͔Β֤λʔϜͷείΞΛܭࢉʢTF-IDFͳͲͰείΞϦϯάʣ 3. λʔϜΛείΞॱʹฒͼସ্͑ͯҐͷλʔϜʢ=ॏཁޠʣΛநग़ 4. ॏཁޠΛ࢖ͬͯ OR ΫΤϦΛ࡞੒ 5. ࡞੒ͨ͠ΫΤϦͰݕࡧͯ͠ɺ্ҐͷจॻΛྨࣅจॻͱͯ͠ฦ͢ লུ
  17. More Like This Query API ͷΞϧΰϦζϜ
 - ·ͱΊ - 1.

    จॻʹؚ·ΕΔλʔϜͷ౷ܭ৘ใʢTF, DFͳͲʣΛऔಘ (retrieveTerms()) IndexReader ͷϝιου getTermVectors(docNum) Ͱ term vectors ʢλʔϜͷ౷ܭ৘ใʣΛಡΈग़͠ 2. ౷ܭ৘ใ͔Β֤λʔϜͷείΞΛܭࢉʢTF-IDFͰείΞϦϯάʣ 1. term vector ͔Β TF ΛΧ΢ϯτ 2. TFIDFSimilarity Ϋϥεʹ DF ͱ numDocs(શυΩϡϝϯτ਺) Λ౉ͯ͠ IDF Λܭࢉ 3. TF * IDF Λܭࢉʢ͜Ε͕λʔϜͷείΞʣ 3. λʔϜΛείΞॱʹฒͼସ্͑ͯҐͷλʔϜʢ=ॏཁޠʣΛநग़ PriorityQueue ΫϥεͰ Top-K ͷλʔϜʢॏཁޠʣΛબ୒ 4. ॏཁޠΛ࢖ͬͯ OR ΫΤϦΛ࡞੒ 5. ࡞੒ͨ͠ΫΤϦͰݕࡧͯ͠ɺ্ҐͷจॻΛྨࣅจॻͱͯ͠ฦ͢ XMoreLikeThis Ϋϥε
 ʹ࣮૷͞Ε͍ͯΔ
  18. Term Vectors API Λ࢖ͬͨ
 ྨࣅจॻݕࡧͷΞϧΰϦζϜ 1. จॻʹؚ·ΕΔλʔϜͷ౷ܭ৘ใʢTF, DFͳͲʣΛऔಘ • Term

    Vectors API 2. ౷ܭ৘ใ͔Β֤λʔϜͷείΞΛܭࢉʢTF-IDFͳͲͰείΞϦϯάʣ • ౷ܭ৘ใ͔ΒಠࣗͷείΞؔ਺Ͱܭࢉ 3. λʔϜΛείΞॱʹฒͼସ্͑ͯҐͷλʔϜʢ=ॏཁޠʣΛநग़ • ॻ͖͍ͨݴޠͰϥΠϒϥϦ͕ͳ͔ͬͨΒࣗલͰ࣮૷ 4. ॏཁޠΛ࢖ͬͯ OR ΫΤϦΛ࡞੒ • Boolean Query ͷ should અʹλʔϜʢॏཁޠʣΛฒ΂Δ 5. ࡞੒ͨ͠ΫΤϦͰݕࡧͯ͠ɺ্ҐͷจॻΛྨࣅจॻͱͯ͠ฦ͢ • Boolean Query Λ Elasticsearch ʹ౤͛Δ
  19. ·ͱΊ 2 छྨͷྨࣅจॻݕࡧͷ΍ΓํΛ঺հ 1. More Like This Query API ʹΑΔྨࣅจॻݕࡧ

    ࠷ॳ͔Β૊Έࠐ·Ε͓ͯΓ API ͚ͩͰ؆୯ʹ࢖͑Δ ΫΤϦ֦ுͷܭࢉ͕ Elasticsearch ಺Ͱ׬݁ʢແବͳ IO ͕ൃੜ͠ͳ͍ʣ νϡʔχϯά͕͍͔ͭ͘ͷύϥϝʔλͷௐ੔ʹݶΒΕΔ 2. Term Vectors API ʹΑΔྨࣅจॻݕࡧ ΑΓॊೈͳΫΤϦ֦ு͕ߦ͑Δʢࣗવݴޠॲཧ΍ػցֶशͱͷ૊Έ߹Θ͕ͤ༰қʣ Term Vector API Ͱฦͬͯ͘ΔϨεϙϯεαΠζ͕େ͖͍ʢωοτϫʔΫෛՙ͕ߴ͍ʣ ΞϓϦέʔγϣϯαʔόͰେྔͷλʔϜΛॲཧ͢Δඞཁ͕༗ΔʢCPUෛՙ͕ߴ͍ʣ