Upgrade to Pro — share decks privately, control downloads, hide ads and more …

検索結果の品質向上

 検索結果の品質向上

2017年4~5月開催「ブートキャンプ特別講座」の資料になります。

Recruit Technologies

June 02, 2017
Tweet

More Decks by Recruit Technologies

Other Decks in Technology

Transcript

  1. Introduction to Information Retrieval Modern Information Retrieval : CS 276

    Information Retrievaland Web Search http://web.stanford.edu/class/cs 276/ …
  2. No

  3. AND

  4. AND

  5. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  6. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  7. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  8. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  9. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  10. Lucene Tokenizer • ClassicTokenizerFactory • EdgeNGramTokenizerFactory • HMMChineseTokenizerFactory • ICUTokenizerFactory

    • JapaneseTokenizerFactory • KeywordTokenizerFactory • LetterTokenizerFactory • LowerCaseTokenizerFactory • NGramTokenizerFactory • PathHierarchyTokenizerFactory • PatternTokenizerFactory • StandardTokenizerFactory • ThaiTokenizerFactory • UAX29URLEmailTokenizerFactory • UIMAAnnotationsTokenizerFactory • UIMATypeAwareAnnotationsTokenize rFactory • WhitespaceTokenizerFactory • WikipediaTokenizerFactory
  11. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  12. 0 5000000 10000000 15000000 20000000 25000000 1 2 3 4

    5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
  13. Ngram vs Ngram Good! hit! Bad… No hit. Bad… hit!

    Good! No hit. Index Bad… Good! index
  14. Linux Linux Linux Unix OS OS : Operating System OS

    OS … OS Linux FreeBSD… … 定量
  15. TF-IDF • TF Term Frequency Term Term • IDF IDF

    Inverse Document Frequency Term • TF-IDF = TF IDF D Term T TF: D T IDF: T
  16. TF-IDF IDF Inverse Document Frequency • Linux • IDF log

    Term T • • • • • Linux • • • IDF(Linux) log 2 5 IDF( ) log 3 5
  17. BM25: TF-IDF Linux Linux Unix OS Linux Linux Linux Linux

    Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux … TF(Linux) 14,352 TF(Linux) 2
  18. BM25: TF-IDF Linux Linux Unix OS Linux Linux Linux Linux

    Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux Linux … TF(Linux) 14,352 TF(Linux) 2
  19. BM25 BM25(Linux) TF IDF TF+k1 (1-b+b ) k1 + 1

    avgDL DL 単に単語が沢山現れ る 場合は減点 単語 単語
  20. BM25 and TF-IDF Linux OS … FreeBSD Unix … •

    Linux: 23 • OS: 11 • : 17 • : 0.331 • : 3.65 • : 0.003 • : 0.0001 • : 0.000053 • : 0.023 • FreeBSD: 65 • OS: 9 • : 5 • : 42 • : 58 • : 2 • :0.003 • : 0.00428 • : 0.00084 • : 90 • : 3 • : 1.8 • : 0.2 • : 0.00189
  21. nDCG DCG S1 S2 S3 S5 S4 DCG5 S1 +

    + + … log2 S2 log3 S3
  22. nDCG S1 S2 S3 S5 S4 DCG5 S1 + +

    + … log2 S2 log3 S3
  23. DCG S1 S2 S3 S5 S4 DCG5 S1 + +

    + … = 439.23 log2 S2 log3 S3
  24. iDCG S1 S2 S4 S5 S3 DCG5 S4 + +

    + … = 518.78 log2 S1 log3 S2
  25. DCG S1 S2 S3 S5 S4 DCG5 S4 + +

    + … = 489.23 log2 S2 log3 S3