はてなブックマーク全文検索の精度改善

 はてなブックマーク全文検索の精度改善

Hatena Engineer Seminar #5 での発表スライド

13f3313ae1ec1d9b3ed76ccbd746291b?s=128

Takuya Asano

June 16, 2015
Tweet

Transcript

  1. ͸ͯͳϒοΫϚʔΫ
 શจݕࡧͷਫ਼౓վળ Hatena Engineer Seminar #5 ઙ໺ ୎໵ id:takuya-a
 @takuya_a

  2. id:takuya-a ϓϥοτϑΥʔϜˍΞυςΫνʔϜ 2015 ೥ 4 ݄ʹೖࣾ ڵຯ • ৘ใݕࡧ •

    ࣗવݴޠॲཧ • ػցֶश OSS ׆ಈ kuromoji.js ͳͲͷ JavaScript ϥΠϒϥϦΛ։ൃ ΞΠίϯมΘΓ·ͨ͠
  3. ݕࡧΩʔϫʔυΛೖྗ

  4. ࠓճ͸ຊจݕࡧͷΈ

  5. ৽ணॱͱਓؾॱ
 ͕ࢦఆͰ͖Δ

  6. ຊจݕࡧͷ՝୊ɿ
 ਫ਼౓͕Α͘ͳ͍

  7. վળલ

  8. None
  9. None
  10. ՝୊ɿຊจݕࡧͷਫ਼౓ ͺͬͱݟͰݕࡧϊΠζ͕ଟ͍ ͨͱ͑͹ɿ • ΤϯτϦʔͷओ୊ʢτϐοΫʣ͕ژ౎ͱશؔ͘܎ͳ͍ • ຊจநग़ͷϛε
 ʢؔ࿈هࣄλΠτϧͱ͜Ζʹʮژ౎ʯ͕͋Δͱ͔ʣ

  11. ͓୊ ʮژ౎ʯͰݕࡧͨ͠ͱ͖ʹ͸
 ʮژ౎ͬΆ͍ʯΤϯτϦʔ͕ग़͖ͯͯ΄͍͠

  12. ͦ΋ͦ΋
 ʮژ౎ͬΆ͍ʯͱ͸

  13. ژ౎ͬΆ͍ʢͱ͸ʣ • ݕࡧχʔζɾϢʔεέʔεΛ૝૾ͯ͠ΈΔ • ʮژ౎ʯͱ͍͏ΫΤϦΛೖྗ͢Δਓ͸
 ԿΛ͍ͨ͠ͷ͔ʁ

  14. ژ౎ͬΆ͍ʢͱ͸ʣ ݕࡧχʔζʢ૝૾ʣ • ژ౎ͷ໊ॴɾݟͲ͜Ζɾ஍໊Λ஌Γ͍ͨ • ژ౎ͳΒͰ͸ͷ॓ധࢪઃΛ஌Γ͍ͨ • ژ౎ͷάϧϝʹ͍ͭͯ஌Γ͍ͨ • ژ౎ͷχϡʔεʹ͍ͭͯ஌Γ͍ͨ

  15. ͨͩ͠
 ʮژ౎ʯ 1 Ωʔϫʔυ͚ͩͰ

  16. ՝୊ 1 ɿΫΤϦ࡞੒ͷϢʔβϏϦςΟ ΤϯυϢʔβʹͱͬͯɺΫΤϦͷ࡞੒͸େม • ద੾ͳΩʔϫʔυ͕ࢥ͍ු͔͹ͳ͍ • ͦ΋ͦ΋஌Γ͍ͨର৅ʹ͍ͭͯ͋·Γ஌Βͳ͍ • ΫΤϦΛߟ͑Δɾೖྗ͢Δͷ͕໘౗ʢಛʹεϚϗʣ

    Ϣʔβ͸ʮژ౎ʯͱ͍͏Ωʔϫʔυ͚ͩͰɺΑ͠ͳʹ΍ͬͯ΄͍͠
 ʢ࣮ࡍ 1 ΩʔϫʔυʹΑΔݕࡧ͕ѹ౗తʹଟ͍ʣ
  17. ՝୊ 2ɿ৽ணॱͰͷݕࡧਫ਼౓ શจݕࡧͰ͸ɺείΞॱҎ֎Ͱ͸ਫ਼౓͕ग़ʹ͍͘ ௨ৗͷશจݕࡧ͸ɺ୯ޠͷग़ݱճ਺ʢ=ස౓ʣͳͲΛߟྀͨ͠είΞॱ

  18. ਫ਼౓ͱ͸ ༷ʑͳλεΫͰجຊͱͳΔ 2 ͭͷධՁࢦඪ • ద߹཰ (Precision) = ݕࡧϊΠζͷগͳ͞ •

    ࠶ݱ཰ (Recall) = ݕࡧ࿙Εͷগͳ͞ ͜ͷ 2 ͭ͸τϨʔυΦϑ • ద߹཰Λ্͛Α͏ͱ͢Δͱɺ࠶ݱ཰͕Լ͕Δ • ࠶ݱ཰Λ্͛Α͏ͱ͢Δͱɺద߹཰͕Լ͕Δ
  19. ਫ਼౓ͱ͸ ۃ୺ͳྫ • ࣗ৴ͷ͋ΔΤϯτϦʔΛ 1 ͚݅ͩग़͢ ɹˠɹద߹཰͸ 100% ʹͳΔ͕ɺ࠶ݱ཰͸௿͘ͳΔ •

    ͢΂ͯͷΤϯτϦʔΛग़͢ ɹˠɹ࠶ݱ཰͸ 100% ʹͳΔ͕ɺద߹཰͸௿͘ͳΔ
  20. ՝୊ 3ɿద߹཰ͱ࠶ݱ཰ͷཱ྆ ݕࡧϊΠζͷগͳ͞ʢద߹཰ʣΛॏࢹ Ͱ΋ݕࡧ࿙Εͷগͳ͞ʢ࠶ݱ཰ʣ΋େࣄ

  21. ՝୊·ͱΊ 1. ΫΤϦ࡞੒ͷϢʔβϏϦςΟ 2. ৽ணॱͰͷݕࡧਫ਼౓ 3. ద߹཰ͱ࠶ݱ཰ͷཱ྆

  22. Ͱ͖ͨ

  23. None
  24. None
  25. http://b.hatena.ne.jp/search/text?q=ژ౎ ژ౎ͷ؍ޫɾάϧϝ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

  26. http://b.hatena.ne.jp/search/text?q=ίʔώʔ ίʔώʔʹؔ͢Δ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

  27. http://b.hatena.ne.jp/search/text?q=ػցֶश ػցֶशʹؔ͢Δ৘ใɾχϡʔε͕
 ͲΜͲΜग़ͯ͘Δʂ

  28. Ͳ͏΍࣮ͬͯݱ͔ͨ͠ʁ

  29. ద߹཰ͱ࠶ݱ཰Λ྆ํ্͛Δ
 ʢ͔͠΋ 1 Ωʔϫʔυɾ৽ணॱͰʣ ਖ਼߈๏Ͱ͸ແཧ είΞؔ਺Λ޻෉ͨ͠Γͯ͠΋ɺ௨ৗͷ
 ΩʔϫʔυݕࡧͰ͸ݶք͕͋Δ ʮژ౎ʯͱ͍͏ 1 ͭͷΩʔϫʔυ͔Βɺ


    ʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛදݱ͢Δඞཁ͕͋Δ
  30. ΞΠσΞɿ͸ͯϒͷλάΛ࢖͏ 10 ೥ʹ౉ͬͯɺ͸ͯϒʹ஝ੵ͞Ε͖ͯͨϝλ৘ใΛ࢖͏ • ͸ͯͳϒοΫϚʔΫͷλά = ਓखʹΑΔʮਖ਼ղσʔλʯ • ژ౎ͱ͍͏λά͕͚ͭΒΕͨΤϯτϦʔ͸ʮژ౎ͬΆ͍ΤϯτϦʔʯͷ͸ͣʂ ɹˠɹͦΕΒͷΤϯτϦʔ͔Βɺʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛநग़͢Δ

    ɹˠɹಘΒΕͨʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛ࢖ͬͯݕࡧ͢Δ ʢλάͷ৘ใ͕ͳͯ͘΋ɺࠓճͷ࿩ͷΞΠσΞࣗମ͸͍ΖΜͳԠ༻͕Ͱ͖Δ͸ͣʣ
  31. ֓೦ݕࡧ ֓೦ݕࡧΛ࢖͍ͬͯΔݕࡧΤϯδϯ • Autonomy (Hewlett-Packard) • GETA (NII) • ConceptBase

    (δϟετγεςϜ) • ͦͷଞɺಛڐݕࡧΤϯδϯͳͲ ֓೦ݕࡧʢConcept Searchɺίϯηϓταʔνɺίϯηϓτݕࡧɺࣗવจݕࡧɺࣗવݴޠจݕࡧɺྨ ࣅจॻݕࡧɺ࿈૝ݕࡧʣ͸ɺࣗಈԽ͞Εͨ৘ใݕࡧͷख๏Ͱɺ஝ੵ͞Εͨඇߏ଄ԽσʔλʢిࢠΞʔ ΧΠϒɺిࢠϝʔϧɺՊֶจݙͳͲʣ͔ΒɺݕࡧΫΤϦʹରͯ͠ɺ֓೦͕ྨࣅ͢Δ৘ใΛݕࡧ͢Δͷ ʹ༻͍ΒΕΔɻ http://ja.wikipedia.org/wiki/֓೦ݕࡧɹΑΓҾ༻
  32. ֓೦ͱ͸ ֓೦ݕࡧͰͷ֓೦ͱ͸ɺΩʔϫʔυͷू߹ • ֓೦͸ɺؔ࿈͢ΔΩʔϫʔυͷू·ΓͰߏ੒͞Εͯ ͍Δͱߟ͑Δ • ෳ਺ͷؔ࿈Ωʔϫʔυͷू߹ʹΑͬͯۙࣅͨ͠΋ͷ

  33. ֓೦ݕࡧͷ͘͠Έ ΩʔͱͳΔΞϧΰϦζϜ͸ҎԼͷ 3 ͭ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு


    ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ
 ֦ுͨ͠ΫΤϦͰશจݕࡧ ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹
 ͦΕΛΫΤϦʹ଍͚ͩ͢
  34. ಛ௃ޠબ୒ʹΑΔ
 ؔ࿈Ωʔϫʔυநग़

  35. ݁Ռ

  36. ژ౎ͷؔ࿈Ωʔϫʔυ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  37. ژ౎ͷಉٛޠʢγϊχϜʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  38. ژ౎ͷ஍໊ʢԼҐޠʣ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  39. ژ౎͔Β࿈૝͞ΕΔޠ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ژ౎ͷࢢ֎ہ൪
  40. ژ౎ͷྺ࢙ɾݟͲ͜Ζ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  41. ژ౎ͷ؍ޫɾ॓ധɾަ௨ػؔ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  42. ژ౎ͷάϧϝ৘ใ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ 5

    ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ
  43. ؔ࿈ΩʔϫʔυʹΑΔ ΫΤϦ֦ுͷޮՌ

  44. ύφιχοΫͷؔ࿈Ωʔϫʔυ 1 lumix 2 dmc 3 viera 4 ௡լ 5

    diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ
  45. ΫΤϦ֦ுͷޮՌɿ
 ಉٛޠʢγϊχϜʣΛࣗಈ֫ಘ 1 lumix 2 dmc 3 viera 4 ௡լ

    5 diga 6 ϓϥζϚ 7 pdp 8 ύφιχοΫ 9 ࡾ༸ 10 btob 11 ༗ػ 12 ࡾ༸ిػ 13 দԼ 14 େ௶ 15 ໊ࣾ 16 el 17 ి޻ 18 ి஑ 19 ces 20 ిػ 21 avc 22 sanyo 23 ϞόΠϧίϛχέʔγϣϯζ 24 ύωϧ 25 ӷথ
  46. εϓϥτΡʔϯͷؔ࿈Ωʔϫʔυ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ 5

    amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ γϡʔλʔ -> γϡʔλ
 ϩʔϥʔ -> ϩʔϥ
 ͱͳ͍ͬͯΔͷ͸
 Elasticsearch ͷΞφϥΠβʔ
 Ͱͷεςϛϯάॲཧͷ݁Ռ
  47. ΫΤϦ֦ுͷޮՌɿ
 ৽ޠͰ΋ಉٛޠʢγϊχϜʣΛ֫ಘ 1 ϒΩ 2 εϓϥτΡʔϯ 3 splatoon 4 γϡʔλ

    5 amiibo 6 wiiu 7 φϫόϦότϧ 8 ΠϯΫ 9 Θ͔͹ 10 tps 11 ࢼࣹ 12 ృΔ 13 γΦΧϥʔζ 14 νϟʔδϟ 15 fps 16 ϩʔϥ 17 pic 18 ཱͪճΓ 19 ΠΧ 20 ϥϯΫ 21 Ϛοϓ 22 ରઓ 23 ΤΠϜ 24 ϚϦΦ 25 ఢ ٯʹΞϧϑΝϕοτʮsplatoonʯ
 ͷؔ࿈ΩʔϫʔυΛग़͢ͱ
 ΧλΧφʮεϓϥτΡʔϯʯ ͕ग़ͯ͘Δʂ
  48. ϢχΫϩͷؔ࿈Ωʔϫʔυ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ 5

    Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦςΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε
  49. ΫΤϦ֦ுͷޮՌɿ
 ܗଶૉղੳϛεͷิ׬ 1 ͠·ΉΒ 2 ༄Ҫ 3 Ϣχ 4 ώʔτςοΫ

    5 Ϋϩ 6 uniqlock 7 uniqlo 8 ແҹ 9 ྑ඼ 10 ҥྉ 11 ض؋ 12 νϊ 13 μα͍ 14 ϑΝʔετϦς ΠϦϯά 15 ࢒ۀ 16 ηʔλ 17 δʔϯζ 18 tγϟπ 19 ళ௕ 20 Ξ΢λ 21 ཭৬ 22 ෰ 23 ඼࣭ 24 ඦ՟ళ 25 ϑϦʔε ܗଶૉղੳϛεΛิ׬Ͱ͖Δ ʢϢχΫϩ͸ࣙॻʹͳ͍ޠʣ ܗଶૉղੳ͸݁Ռ͕ΏΕΔ ʮϢχ/Ϋϩʯͱ෼ׂ͞Εͯ ΠϯσοΫεʹొ࿥͞Ε͍ͯͯ ΫΤϦ͸ʮϢχΫϩʯͱ1τʔΫϯ ʹͳͬͨͱͯ͠΋ݕࡧͰ͖Δ ܗଶૉղੳثͷۤखͳύλʔϯΛ ࣗಈతʹֶशɾิ׬
  50. ؔ࿈ΩʔϫʔυʹΑΔ ΫΤϦ֦ுͷݶք

  51. ͢΂ͯͷݕࡧχʔζΛ
 ΧόʔͰ͖ΔΘ͚Ͱ͸ͳ͍ 1 ᷫԂ 2 ࣉ 3 075 4 ਆࣾ

    5 ࡩ 6 ژ 7 ۙమ 8 খ࿏ 9 ߚ༿ 10 ඒज़ؗ 11 ্ژ 12 ఻౷ 13 ӊؙ 14 ௨ 15 ொՈ 16 ΤϦΞ 17 ཱྀؗ 18 ·ͪ 19 ത෺ؗ 20 ల 21 ؍ޫ 22 Տݪொ 23 kyoto 24 ΧϑΣ 25 தژ ʮژ౎ͷఱؾʯΛ஌Γ͍ͨਓʹ͸ແҙຯ ࣌ؒతɾۭؒతܭࢉίετͷ੍໿͕͋ΔͨΊ ແݶʹ͸ΫΤϦ֦ுͰ͖ͳ͍
  52. ͜͜·Ͱͷ·ͱΊ • ؔ࿈Ωʔϫʔυͷநग़͸ɺ͏·͍͍ͬͯ͘ΔΑ͏ʹݟ͑Δ • ΫΤϦ֦ுʹΑͬͯɺϢʔβʹΑΔΫΤϦ࡞੒ΛิॿͰ͖Δ • ΫΤϦ֦ுʹ͸ݶք΋͋Δ • ͨͩ͠ಉٛޠల։΍ܗଶૉղੳϛεͷิ׬ͳͲϝϦοτ͕େ͖͍ Ͳ͏΍ͬͯؔ࿈ΩʔϫʔυΛநग़͔ͨ͠ʁ

  53. ಛ௃ޠબ୒ʹΑΔ
 ؔ࿈Ωʔϫʔυநग़
 ʢΞϧΰϦζϜฤʣ

  54. ؔ࿈Ωʔϫʔυͱ͸ ؔ࿈Ωʔϫʔυ͸ɺҎԼͷಛ௃Λ΋͍ͬͯΔʢͱߟ͑ͨʣ • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨจॻͷ಺༰Λද͍ͯ͠Δޠ (= ಛ௃ޠ) • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨෳ਺ͷจॻʹݱΕΔ ·ͣɺͻͱͭͷจॻʢΤϯτϦʔʣ͔Β
 ಛ௃ޠΛநग़͢Δ͜ͱΛߟ͑Δ

  55. ಛ௃ޠநग़ͷํ਑ ػցֶशʢeg. ϥϯΫֶशʣ͸࢖Θͳ͍ • ؔ܎ऀ΁ͷઆ໌ɺ݁Ռͷղऍ͕೉͘͠ͳΔ • ਓखʹΑΔϧʔϧɾώϡʔϦεςΟοΫʹରԠ͠ʹ͍͘ • ༻ҙͰ͖Δσʔλ͕গͳ͍ͨΊɺաֶश͢ΔՄೳੑ͕ߴ͍ •

    ݹయతͳ৘ใݕࡧ (Information Retrieval) ͷख๏ʹཔΔ • Elasticsearch ͷ Term Vectors API ͰऔಘͰ͖Δɺ
 λʔϜʢ୯ޠʣͷ౷ܭ৘ใΛར༻
  56. Elasticsearch Term Vectors API Term Vectors ͔ΒऔಘͰ͖ΔλʔϜͷ౷ܭ৘ใ λʔϜ = Elasticsearch

    ͷΞφϥΠβʔʹΑͬͯ෼ׂ͞Εͨ΋ͷ
 ʢ͍ΘΏΔ୯ޠͱҰக͠ͳ͍৔߹΋͋ΔͷͰ஫ҙʣ term_freq ΤϯτϦʔதͷλʔϜͷग़ݱճ਺ʢස౓ʣ doc_freq ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ ttf ΠϯσοΫεશମͰλʔϜ͕ݱΕΔग़ݱճ਺ʢස౓ʣͷ࿨ doc_count ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺
  57. Elasticsearch Ͱͷ ؔ࿈Ωʔϫʔυநग़ͷྲྀΕ 1. ʢλάݕࡧʣೖྗ͞ΕͨΫΤϦͰɺλάͷϑΟʔϧυʹରͯ͠
 Filtered Query ͰߜΓࠐΈݕࡧ 2. ʢ౷ܭ৘ใऔಘʣTerm

    Vectors API (ݫີʹ͸ Multi Term Vectors API) Ͱ
 λʔϜͷ౷ܭ৘ใΛऔಘ 3. ʢಛ௃ޠநग़ʣͻͱͭͻͱͭͷจॻʹରͯ͠ɺλʔϜͷॏཁ౓ΛαʔόͰܭࢉ͠
 Top-25 ͷಛ௃ޠΛநग़ 4. ʢؔ࿈Ωʔϫʔυநग़ʣগ਺ͷจॻʹ͔͠ݱΕͳ͍λʔϜ͸མͱ͠ɺ
 ࠷΋είΞ͕ߴ͍ Top-25 ͷλʔϜΛநग़ ʢؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹ɺ͋ͱ͸ΫΤϦʹΩʔϫʔυΛ଍͚ͩ͢ʣ λʔϜͷॏཁ౓ΛͲͷΑ͏ʹܭࢉ͢Δ͔ʁ Tips ೋ෼ώʔϓΛ࢖ͬͯ Top-K ܭࢉΛߴ଎Խ
  58. ಛ௃ޠͷܭࢉΞϧΰϦζϜ ৘ใݕࡧͰ͸ɺλʔϜͷॏཁ౓ʢॏΈ෇͚ʣͷࢦඪͱͯ͠͸
 ҎԼͷ 2 ͕ͭσϑΝΫτ • TF-IDF ( TF:୯ޠස౓ ͱ

    IDF:จॻස౓ͷٯ਺ ͷֻ͚ࢉ ) • BM25 ( ֬཰Ϟσϧʹج͖ͮɺจॻ௕΋ߟྀͨ͠΋ͷ ) TF-IDF Λ࠾༻ BM25 ͸ܭࢉίετ͕ߴ͍ & 2ͭͷύϥϝʔλௐ੔͕ඞཁ
  59. TF-IDF TF-IDF ͷܭࢉࣜ ʢ͍͔ͭ͘ͷόϦΤʔγϣϯ͕͋Δʣ TF-IDF ͚ͩͰ͸͏·͍͔͘ͳ͍ • "1" ͷΑ͏ͳɺ͋·Γҙຯͷͳ͍਺ࣈ •

    "to" ͷΑ͏ͳɺӳޠͷετοϓϫʔυ
 ʢ೔ຊޠͷετοϓϫʔυ͸ɺ Elasticsearch ͷϑΟϧλͰ஄͔ΕΔʣ TF : λʔϜස౓ fi,j → จॻதʹԿճ΋ग़ͯ͘ΔλʔϜ΄Ͳߴ͘ͳΔ IDF : จॻස౓ ni ͷٯ਺ʢN ͸จॻ਺ʣ→ ϨΞͳλʔϜ΄Ͳߴ͘ͳΔ fi,j : ΤϯτϦʔ j ʹݱΕΔλʔϜͷग़ݱճ਺ʢස౓ʣ = term_freq N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
  60. ಈతετοϓϫʔυ
 (Dynamic stop word list) ετοϓϫʔυࣙॻΛࣄલʹ༻ҙɾϝϯς͢Δͷ͸ݶք͕͋Δʢ೔ຊޠɾӳޠɾɾɾʣ λʔϜͷ౷ܭ৘ใ͔ΒɺετοϓϫʔυΛಈతʹܭࢉ
 ʢΠϯσοΫεʹొ࿥͞Ε͍ͯΔΤϯτϦʔʹରͯ͠ॊೈʹରԠͰ͖Δʣ IDF ݹయతͳख๏

    RIDF ػೳޠɾ಺༰ޠͷࠩʹ஫໨ͨ͠ख๏ Gain ৘ใྔͷརಘʹ஫໨ͨ͠ख๏ ͜ΕΒͷࢦඪ͕
 ͋Δᮢ஋ΑΓ
 ௿͍λʔϜΛ
 ετοϓϫʔυ
 ͱͯ͠ഉআ
  61. IDF Inversed Document Frequency ΄ͱΜͲͷจॻʹग़ͯ͘ΔΑ͏ͳλʔϜ͸είΞ͕௿͘ͳΔ N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count

    ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
  62. RIDF Residual IDF; ࢒ࠩ IDF Church, K. W. and Gale,

    W. A. (1995a). “Inverse Document Frequency (IDF): A Measure of Deviation from Poisson.” In Proc. of the 3rd Workshop on Very Large Corpora, pp. 121–130. N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Fi : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔ૯਺ = ttf 1 จॻதͷλʔϜͷग़ݱճ਺Λ
 ϙΞιϯ෼෍ͰϞσϧԽ ػೳޠɿଟ਺ͷจॻʹ͹Β͚ͯଘࡏʢۉҰʹ෼෍ʣ ಺༰ޠɿগ਺ͷจॻʹूதͯ͠ଘࡏʢภͬͯ෼෍ʣ RIDF = ਪఆͨ͠ IDF ͱ࣮ࡍͷ IDF ͱͷࠩ ϙΞιϯ෼෍ P(k; λi) ͷύϥϝʔλ(=ظ଴஋) λi ͸ λʔϜͷશग़ݱճ਺ (Fi) / จॻ਺ (N) ͰਪఆʢλʔϜ͕ۉҰʹ෼෍͍ͯ͠ΔͱԾఆʣ P(0; λi) ͸ɺͦͷλʔϜ͕ 1 ճ΋ग़ͯ͜ͳ͍֬཰ ͭ·Γ 1 - P(0; λi) ͸ 1 ճͰ΋ग़ͯ͘Δ֬཰ → ෼฼ N (1 - P(0; λi)) ͸จॻස౓ ni ͷਪఆ஋ → ӈล͸ IDFi ͷਪఆ஋ RIDF ͕௿͍ RIDF ͕ߴ͍ ਪఆ஋ͱͷ͕ࠩখ ਪఆ஋ͱͷ͕ࠩେ
  63. Gain ۃ୺ʹߴස౓ͷλʔϜʢ΄ͱΜͲͷΤϯτϦʔʹग़ͯ͘Δʣ
 ۃ୺ʹ௿ස౓ͷλʔϜʢ਺ݸͷΤϯτϦʔʹ͔͠ग़ͯ͜ͳ͍ʣͰείΞ͕௿͘ͳΔ ݁ՌɿޮՌ͕ͳ͔ͬͨ
 ߴස౓ͷλʔϜ͸ɺ Elasticsearch ʹΑͬͯɺ͢ͰʹϑΟϧλ͞Ε͍ͯΔ N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺

    = doc_count ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq Papineni, K. (2001). “Why Inverse Document Frequency?” In Proc. of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2001), pp. 25–32.
  64. ·ͱΊɿಛ௃ޠநग़ • ಛ௃ޠநग़ͷΞϧΰϦζϜΛղઆ • TF-IDF • ಈతετοϓϫʔυͰετοϓϫʔυࣙॻͷϝϯςΛෆཁʹ • IDF •

    RIDF • Gain
  65. ಛ௃ޠબ୒ͷධՁͱ࠷దԽ • Top-50 ͰධՁ • ద߹ͯ͠Δ (+1) ͔ɺ͍ͯ͠ͳ͍ (-1) ͔ͷ

    2 ஋෼ྨͱΈͳ͢ • ਖ਼ղσʔλΛ༻ҙʢ500ݸ΄Ͳʣ いく -1 ゴッホ +1 あんまり -1 ... ܳज़ʹؔ͢ΔΤϯτϦʔ
  66. ࠓճ͸ MAP ͰධՁ & ࠷దԽ MAP 90% Ҏ্Λୡ੒
 ʢରτϨʔχϯάσʔλʣ ධՁࢦඪɹP@n

    / AP / MAP
 MAP ʹΑΔ࠷దԽ • P@n; Precision at n
 ୈ n Ґ·Ͱͷద߹཰ • AP; Average Precision
 P@n Λ n ·ͰͰฏۉͨ͠ࢦඪ • MAP; Mean Average Precision
 AP Λ͢΂ͯͷΤϯτϦʔͰฏۉ Max MAP : 0.9173 ----------------------- IDF threshold : 6.0 RIDF threshold : 0.55 Gain threshold : 0.0 Α͍͜͸ަࠩݕఆ͠·͠ΐ͏
  67. ·ͱΊɿධՁͱ࠷దԽ • 3 ͭͷධՁࢦඪ (P@n / AP / MAP) Λ঺հ

    • ධՁࢦඪ (MAP) Λ࠷େԽ͢ΔΑ͏ʹ
 ύϥϝʔλΛ࠷దԽ • 3ͭͷείΞؔ਺ (TF-IDF, IDF, RIDF) Λ૊Έ߹ΘͤΔ
 ͜ͱͰߴ͍ਫ਼౓Λୡ੒
  68. ෮शɿ֓೦ݕࡧͷΞ΢τϥΠϯ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு
 ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ


    ֦ுͨ͠ΫΤϦͰશจݕࡧ ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹
 ͦΕΛΫΤϦʹ଍͚ͩ͢
  69. Elasticsearch ʹΑΔ
 ΫΤϦ֦ுˍ֓೦ݕࡧ • ݩͷΩʔϫʔυ͸ʮඞؚͣΉʯ
 ʢ Bool Query ͷ must

    અʣ • ؔ࿈Ωʔϫʔυ͸είΞʹ଍͞ΕΔ͚ͩ
 ʢ Bool Query ͷ should અʣ • είΞʹᮢ஋
 ʢ Query ʹ min_score Λࢦఆʣ
  70. ֓೦ݕࡧͷΞ΢τϥΠϯ 1. ؔ࿈Ωʔϫʔυநग़
 ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ
 ͢ΔΩʔϫʔυΛநग़ 2. ΫΤϦ֦ு
 ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ 3. ֓೦ݕࡧ


    ֦ுͨ͠ΫΤϦͰશจݕࡧ
  71. શମͷ·ͱΊ • ؔ࿈Ωʔϫʔυநग़ʹΑΔΫΤϦ֦ுͰద߹཰ΛߴΊͨ • ೖྗΩʔϫʔυ͕ 1 ͭͰ΋͸ͯϒͷλά৘ใͰิ׬ • ৽ணॱͰ΋ߴ͍ద߹཰

  72. ͸ͯϒશจݕࡧͷ͜Ε͔Β • ݕࡧਫ਼౓ͷධՁɾύϥϝʔλͷ࠷దԽ • ద߹཰͸ߴ͍ϨϕϧΛอͪͭͭ࠶ݱ཰Λ޲্ • Elasticsearch ʹղੳ༻ͷϑΟʔϧυɾࣙॻΛ௥Ճ • ղੳϛεΛ͓͑͞ɺ͞ΒͳΔਫ਼౓޲্΁

    • neologd ͷΑ͏ͳ৽ޠࣙॻΛݕ౼ • ؔ࿈Ωʔϫʔυநग़͸༷ʑͳԠ༻͕ظ଴Ͱ͖Δ
  73. ࢀߟจݙ TF-IDF ͷόϦΤʔγϣϯ, BM25, Feature Selection, Information Gain, ٖࣅద߹ੑϑΟʔυόοΫ, ϥϯΫֶशͳͲ

    R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999. Recall, Precision, P@n, AP, MAP, Binary heap ʹΑΔ Top-K ͳͲ Büttcher S, Clarke C, Cormack GV.Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010. ϙΞιϯ෼෍Ϟσϧ, IDF, RIDF ͳͲ Manning, C. D., & Schutze, H. Foundations of statistical natural language processing. The MIT Press, 1999. IDF, RIDF ʹΑΔ Dynamic stop word list Amati, G., Carpineto, C., Romano, G. (Eds.). Advances in Information Retrieval, 29th European Conference on IR Research, ECIR 2007, Rome, Italy, April 2-5, 2007, Proceedings. Lecture Notes in Computer Science Springer Volume 4425, 2007. IDF, RIDF ʹΑΔࡧҾޠͷॏΈ෇͚ ๺, ௡ా, ࢰʑງ. ৘ใݕࡧΞϧΰϦζϜ, ڞཱग़൛, 2002.