検索結果の品質向上

 検索結果の品質向上

2017年4~5月開催「ブートキャンプ特別講座」の資料になります。

Eea9a05e6e222a3d50c73f54a49fadf4?s=128

Recruit Technologies

June 02, 2017
Tweet

Transcript

  1. Elasticsearch AP

  2. • • • • Term • index • • •

  3. • • • • Term • index • • •

  4. The Art of Computer Programming, Volume 3: Sorting and Searching

    • • •
  5. The Art of Computer Programming, Volume 3: Sorting and Searching

    • • •
  6. Introduction to Information Retrieval Modern Information Retrieval

  7. Introduction to Information Retrieval Modern Information Retrieval : CS 276

    Information Retrievaland Web Search http://web.stanford.edu/class/cs 276/ …
  8. Java $ grep index index

  9. Index • grep • Boyer-Moore • • N-gram •

  10. Java $ grep index index Index

  11. None
  12. None
  13. None
  14. Index

  15. Index Indexing

  16. Index Search (AND )

  17. Index Search (OR )

  18. Index

  19. Index index

  20. • • • • Term • index • • •

  21. Inverted Index

  22. Inverted Index

  23. Inverted Index

  24. Inverted Index

  25. Term At A Time

  26. None
  27. None
  28. None
  29. None
  30. None
  31. Term At A Time

  32. Inverted Index

  33. Inverted Index

  34. Document At A Time

  35. None
  36. None
  37. None
  38. None
  39. Document At A Time

  40. Inverted Index

  41. – – – –

  42. None
  43. No

  44. Java

  45. Lucene • – • – • –

  46. Document At A Time

  47. Elasticsearch & Solr • • •

  48. None
  49. Elasticsearch

  50. • • • • Term • index • • •

  51. • •

  52. 映画

  53. • • – • –

  54. • • – • –

  55. A A C C B B • • – •

  56. A A C C B B • • – •

  57. A A C C B B • • – •

  58. A A C C B B • • – •

  59. None
  60. DAAT

  61. DAAT

  62. None
  63. None
  64. None
  65. None
  66. None
  67. None
  68. None
  69. Index

  70. None
  71. None
  72. None
  73. AND

  74. AND

  75. • • • • Term • index • • •

  76. None
  77. None
  78. • • • • Term • index • • •

  79. Term At A Time

  80. Term • • • •

  81. Lucene Term Token lucene/core/src/java/org/apache/lucene/index/Term.java lucene/core/src/java/org/apache/lucene/analysis/Token.java

  82. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  83. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  84. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  85. Tokenize WhitespaceTokenizer

  86. Tokenize …

  87. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  88. NgramTokenizer • • • Q: Bigram A: No… ⇒ N-gram

    2-1gram
  89. Japanese Tokenizer Kuromoji Surface form Part-of-Speech Base form Reading Pronunciation

    , ,*,* , , ,* , ,*,* , , ,* , ,*,* ,*,*,*
  90. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  91. Analyzer

  92. Analyzer in Elasticseach

  93. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  94. • • • • Term • index • • •

  95. index

  96. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  97. None
  98. • • • • Term • index • • •

  99. None
  100. C B A B A C A

  101. C B A B A B A

  102. C B A C A C A

  103. C B A C A B A • •

  104. C B A C A B A ↑ ↑

  105. C B A C A B A ↓ ↓

  106. C B A C A B A ↓ →

  107. C B A C A B A ↑ →

  108. C B A C A B A

  109. C B A C A B A • •

  110. C B A C A B A • • •

  111. C B A C A B A • • •

  112. C B A C A B A

  113. C A B A F-measure F-measure

  114. None
  115. • – • – • –

  116. None
  117. 0 5000000 10000000 15000000 20000000 25000000 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
  118. None
  119. None
  120. None
  121. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  122. index • – • – • – •

  123. None
  124. index ngram index • • • •

  125. None
  126. None
  127. index ngram index

  128. None
  129. None
  130. • • • • Term • index • • •

  131. None
  132. • – • – • – • –

  133. • – • – • – • –

  134. None
  135. Linux Linux Linux Unix OS OS : Operating System OS

    OS … OS Linux FreeBSD… … 定量
  136. None
  137. TF-IDF • TF Term Frequency Term Term • IDF IDF

    Inverse Document Frequency Term • TF-IDF = TF IDF D Term T TF: D T IDF: T
  138. TF-IDF TF Term Frequency Term Linux Linux Unix OS

  139. TF-IDF IDF Inverse Document Frequency • Linux • IDF log

    Term T • • • • • Linux • • • IDF(Linux) log 2 5 IDF( ) log 3 5
  140. TF-IDF IDF Inverse Document Frequency • – • log –

    TF IDF log Term T
  141. TF-IDF

  142. None
  143. BM25: TF-IDF Linux Linux Unix OS Linux Linux Linux Linux

    Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux … TF(Linux) 14,352 TF(Linux) 2
  144. BM25: TF-IDF Linux Linux Unix OS Linux Linux Linux Linux

    Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux … TF(Linux) 14,352 TF(Linux) 2
  145. BM25 Linux Linux Unix OS BM25(Linux) TF IDF TF+k1 (1-b+b

    ) k1 + 1 avgDL DL
  146. BM25 BM25(Linux) TF IDF TF+k1 (1-b+b ) k1 + 1

    avgDL DL 単に単語が沢山現れ る 場合は減点 単語 単語
  147. BM25 and TF-IDF Linux OS … FreeBSD Unix … •

    Linux: 23 • OS: 11 • : 17 • : 0.331 • : 3.65 • : 0.003 • : 0.0001 • : 0.000053 • : 0.023 • FreeBSD: 65 • OS: 9 • : 5 • : 42 • : 58 • : 2 • :0.003 • : 0.00428 • : 0.00084 • : 90 • : 3 • : 1.8 • : 0.2 • : 0.00189
  148. None
  149. index • – • – • – •

  150. None
  151. None
  152. None
  153. None
  154. None
  155. • • • • Term • index • • •

  156. nDCG • • • • •

  157. nDCG

  158. nDCG

  159. nDCG

  160. nDCG

  161. nDCG

  162. nDCG

  163. None
  164. nDCG S1 S2 S3 S5 S4

  165. nDCG DCG S1 S2 S3 S5 S4 DCG5 S1 +

    + + … log2 S2 log3 S3
  166. nDCG S1 S2 S3 S5 S4 DCG5 S1 + +

    + … log2 S2 log3 S3
  167. nDCG DCG=iDCG (ideal DCG) S1 S2 S4 S5 S3

  168. None
  169. DCG S1 S2 S3 S5 S4 DCG5 S1 + +

    + … = 439.23 log2 S2 log3 S3
  170. iDCG S1 S2 S4 S5 S3 DCG5 S4 + +

    + … = 518.78 log2 S1 log3 S2
  171. 518.78 439.23

  172. DCG S1 S2 S3 S5 S4 DCG5 S4 + +

    + … = 489.23 log2 S2 log3 S3
  173. 518.78 489.23

  174. nDCG S1 S2 S3 S5 S4 S1 S2 S3 S5

    S4
  175. None
  176. • • • • Term • index • • •

  177. None
  178. MLR = Machine Learning Ranking S1 S2 S3 S5 S4

    S2 S3 S5 S4 S1
  179. None
  180. None