Upgrade to Pro — share decks privately, control downloads, hide ads and more …

はてなブックマーク全文検索の精度改善

 はてなブックマーク全文検索の精度改善

Hatena Engineer Seminar #5 での発表スライド

Takuya Asano

June 16, 2015
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. ͸ͯͳϒοΫϚʔΫ

    શจݕࡧͷਫ਼౓վળ
    Hatena Engineer Seminar #5
    ઙ໺ ୎໵
    id:takuya-a

    @takuya_a

    View Slide

  2. id:takuya-a
    ϓϥοτϑΥʔϜˍΞυςΫνʔϜ
    2015 ೥ 4 ݄ʹೖࣾ
    ڵຯ
    • ৘ใݕࡧ
    • ࣗવݴޠॲཧ
    • ػցֶश
    OSS ׆ಈ
    kuromoji.js ͳͲͷ JavaScript ϥΠϒϥϦΛ։ൃ
    ΞΠίϯมΘΓ·ͨ͠

    View Slide

  3. ݕࡧΩʔϫʔυΛೖྗ

    View Slide

  4. ࠓճ͸ຊจݕࡧͷΈ

    View Slide

  5. ৽ணॱͱਓؾॱ

    ͕ࢦఆͰ͖Δ

    View Slide

  6. ຊจݕࡧͷ՝୊ɿ

    ਫ਼౓͕Α͘ͳ͍

    View Slide

  7. վળલ

    View Slide

  8. View Slide

  9. View Slide

  10. ՝୊ɿຊจݕࡧͷਫ਼౓
    ͺͬͱݟͰݕࡧϊΠζ͕ଟ͍
    ͨͱ͑͹ɿ
    • ΤϯτϦʔͷओ୊ʢτϐοΫʣ͕ژ౎ͱશؔ͘܎ͳ͍
    • ຊจநग़ͷϛε

    ʢؔ࿈هࣄλΠτϧͱ͜Ζʹʮژ౎ʯ͕͋Δͱ͔ʣ

    View Slide

  11. ͓୊
    ʮژ౎ʯͰݕࡧͨ͠ͱ͖ʹ͸

    ʮژ౎ͬΆ͍ʯΤϯτϦʔ͕ग़͖ͯͯ΄͍͠

    View Slide

  12. ͦ΋ͦ΋

    ʮژ౎ͬΆ͍ʯͱ͸

    View Slide

  13. ژ౎ͬΆ͍ʢͱ͸ʣ
    • ݕࡧχʔζɾϢʔεέʔεΛ૝૾ͯ͠ΈΔ
    • ʮژ౎ʯͱ͍͏ΫΤϦΛೖྗ͢Δਓ͸

    ԿΛ͍ͨ͠ͷ͔ʁ

    View Slide

  14. ژ౎ͬΆ͍ʢͱ͸ʣ
    ݕࡧχʔζʢ૝૾ʣ
    • ژ౎ͷ໊ॴɾݟͲ͜Ζɾ஍໊Λ஌Γ͍ͨ
    • ژ౎ͳΒͰ͸ͷ॓ധࢪઃΛ஌Γ͍ͨ
    • ژ౎ͷάϧϝʹ͍ͭͯ஌Γ͍ͨ
    • ژ౎ͷχϡʔεʹ͍ͭͯ஌Γ͍ͨ

    View Slide

  15. ͨͩ͠

    ʮژ౎ʯ
    1 Ωʔϫʔυ͚ͩͰ

    View Slide

  16. ՝୊ 1 ɿΫΤϦ࡞੒ͷϢʔβϏϦςΟ
    ΤϯυϢʔβʹͱͬͯɺΫΤϦͷ࡞੒͸େม
    • ద੾ͳΩʔϫʔυ͕ࢥ͍ු͔͹ͳ͍
    • ͦ΋ͦ΋஌Γ͍ͨର৅ʹ͍ͭͯ͋·Γ஌Βͳ͍
    • ΫΤϦΛߟ͑Δɾೖྗ͢Δͷ͕໘౗ʢಛʹεϚϗʣ
    Ϣʔβ͸ʮژ౎ʯͱ͍͏Ωʔϫʔυ͚ͩͰɺΑ͠ͳʹ΍ͬͯ΄͍͠

    ʢ࣮ࡍ 1 ΩʔϫʔυʹΑΔݕࡧ͕ѹ౗తʹଟ͍ʣ

    View Slide

  17. ՝୊ 2ɿ৽ணॱͰͷݕࡧਫ਼౓
    શจݕࡧͰ͸ɺείΞॱҎ֎Ͱ͸ਫ਼౓͕ग़ʹ͍͘
    ௨ৗͷશจݕࡧ͸ɺ୯ޠͷग़ݱճ਺ʢ=ස౓ʣͳͲΛߟྀͨ͠είΞॱ

    View Slide

  18. ਫ਼౓ͱ͸
    ༷ʑͳλεΫͰجຊͱͳΔ 2 ͭͷධՁࢦඪ
    • ద߹཰ (Precision) = ݕࡧϊΠζͷগͳ͞
    • ࠶ݱ཰ (Recall) = ݕࡧ࿙Εͷগͳ͞
    ͜ͷ 2 ͭ͸τϨʔυΦϑ
    • ద߹཰Λ্͛Α͏ͱ͢Δͱɺ࠶ݱ཰͕Լ͕Δ
    • ࠶ݱ཰Λ্͛Α͏ͱ͢Δͱɺద߹཰͕Լ͕Δ

    View Slide

  19. ਫ਼౓ͱ͸
    ۃ୺ͳྫ
    • ࣗ৴ͷ͋ΔΤϯτϦʔΛ 1 ͚݅ͩग़͢
    ɹˠɹద߹཰͸ 100% ʹͳΔ͕ɺ࠶ݱ཰͸௿͘ͳΔ
    • ͢΂ͯͷΤϯτϦʔΛग़͢
    ɹˠɹ࠶ݱ཰͸ 100% ʹͳΔ͕ɺద߹཰͸௿͘ͳΔ

    View Slide

  20. ՝୊ 3ɿద߹཰ͱ࠶ݱ཰ͷཱ྆
    ݕࡧϊΠζͷগͳ͞ʢద߹཰ʣΛॏࢹ
    Ͱ΋ݕࡧ࿙Εͷগͳ͞ʢ࠶ݱ཰ʣ΋େࣄ

    View Slide

  21. ՝୊·ͱΊ
    1. ΫΤϦ࡞੒ͷϢʔβϏϦςΟ
    2. ৽ணॱͰͷݕࡧਫ਼౓
    3. ద߹཰ͱ࠶ݱ཰ͷཱ྆

    View Slide

  22. Ͱ͖ͨ

    View Slide

  23. View Slide

  24. View Slide

  25. http://b.hatena.ne.jp/search/text?q=ژ౎
    ژ౎ͷ؍ޫɾάϧϝ৘ใɾχϡʔε͕

    ͲΜͲΜग़ͯ͘Δʂ

    View Slide

  26. http://b.hatena.ne.jp/search/text?q=ίʔώʔ
    ίʔώʔʹؔ͢Δ৘ใɾχϡʔε͕

    ͲΜͲΜग़ͯ͘Δʂ

    View Slide

  27. http://b.hatena.ne.jp/search/text?q=ػցֶश
    ػցֶशʹؔ͢Δ৘ใɾχϡʔε͕

    ͲΜͲΜग़ͯ͘Δʂ

    View Slide

  28. Ͳ͏΍࣮ͬͯݱ͔ͨ͠ʁ

    View Slide

  29. ద߹཰ͱ࠶ݱ཰Λ྆ํ্͛Δ

    ʢ͔͠΋ 1 Ωʔϫʔυɾ৽ணॱͰʣ
    ਖ਼߈๏Ͱ͸ແཧ
    είΞؔ਺Λ޻෉ͨ͠Γͯ͠΋ɺ௨ৗͷ

    ΩʔϫʔυݕࡧͰ͸ݶք͕͋Δ
    ʮژ౎ʯͱ͍͏ 1 ͭͷΩʔϫʔυ͔Βɺ

    ʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛදݱ͢Δඞཁ͕͋Δ

    View Slide

  30. ΞΠσΞɿ͸ͯϒͷλάΛ࢖͏
    10 ೥ʹ౉ͬͯɺ͸ͯϒʹ஝ੵ͞Ε͖ͯͨϝλ৘ใΛ࢖͏
    • ͸ͯͳϒοΫϚʔΫͷλά = ਓखʹΑΔʮਖ਼ղσʔλʯ
    • ژ౎ͱ͍͏λά͕͚ͭΒΕͨΤϯτϦʔ͸ʮژ౎ͬΆ͍ΤϯτϦʔʯͷ͸ͣʂ
    ɹˠɹͦΕΒͷΤϯτϦʔ͔Βɺʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛநग़͢Δ
    ɹˠɹಘΒΕͨʮژ౎ͬΆ͍ͱ͍͏֓೦ʯΛ࢖ͬͯݕࡧ͢Δ
    ʢλάͷ৘ใ͕ͳͯ͘΋ɺࠓճͷ࿩ͷΞΠσΞࣗମ͸͍ΖΜͳԠ༻͕Ͱ͖Δ͸ͣʣ

    View Slide

  31. ֓೦ݕࡧ
    ֓೦ݕࡧΛ࢖͍ͬͯΔݕࡧΤϯδϯ
    • Autonomy (Hewlett-Packard)
    • GETA (NII)
    • ConceptBase (δϟετγεςϜ)
    • ͦͷଞɺಛڐݕࡧΤϯδϯͳͲ
    ֓೦ݕࡧʢConcept Searchɺίϯηϓταʔνɺίϯηϓτݕࡧɺࣗવจݕࡧɺࣗવݴޠจݕࡧɺྨ
    ࣅจॻݕࡧɺ࿈૝ݕࡧʣ͸ɺࣗಈԽ͞Εͨ৘ใݕࡧͷख๏Ͱɺ஝ੵ͞Εͨඇߏ଄ԽσʔλʢిࢠΞʔ
    ΧΠϒɺిࢠϝʔϧɺՊֶจݙͳͲʣ͔ΒɺݕࡧΫΤϦʹରͯ͠ɺ֓೦͕ྨࣅ͢Δ৘ใΛݕࡧ͢Δͷ
    ʹ༻͍ΒΕΔɻ
    http://ja.wikipedia.org/wiki/֓೦ݕࡧɹΑΓҾ༻

    View Slide

  32. ֓೦ͱ͸
    ֓೦ݕࡧͰͷ֓೦ͱ͸ɺΩʔϫʔυͷू߹
    • ֓೦͸ɺؔ࿈͢ΔΩʔϫʔυͷू·ΓͰߏ੒͞Εͯ
    ͍Δͱߟ͑Δ
    • ෳ਺ͷؔ࿈Ωʔϫʔυͷू߹ʹΑͬͯۙࣅͨ͠΋ͷ

    View Slide

  33. ֓೦ݕࡧͷ͘͠Έ
    ΩʔͱͳΔΞϧΰϦζϜ͸ҎԼͷ 3 ͭ
    1. ؔ࿈Ωʔϫʔυநग़

    ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ

    ͢ΔΩʔϫʔυΛநग़
    2. ΫΤϦ֦ு

    ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ
    3. ֓೦ݕࡧ

    ֦ுͨ͠ΫΤϦͰશจݕࡧ
    ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹

    ͦΕΛΫΤϦʹ଍͚ͩ͢

    View Slide

  34. ಛ௃ޠબ୒ʹΑΔ

    ؔ࿈Ωʔϫʔυநग़

    View Slide

  35. ݁Ռ

    View Slide

  36. ژ౎ͷؔ࿈Ωʔϫʔυ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  37. ژ౎ͷಉٛޠʢγϊχϜʣ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  38. ژ౎ͷ஍໊ʢԼҐޠʣ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  39. ژ౎͔Β࿈૝͞ΕΔޠ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ
    ژ౎ͷࢢ֎ہ൪

    View Slide

  40. ژ౎ͷྺ࢙ɾݟͲ͜Ζ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  41. ژ౎ͷ؍ޫɾ॓ധɾަ௨ػؔ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  42. ژ౎ͷάϧϝ৘ใ
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ

    View Slide

  43. ؔ࿈ΩʔϫʔυʹΑΔ
    ΫΤϦ֦ுͷޮՌ

    View Slide

  44. ύφιχοΫͷؔ࿈Ωʔϫʔυ
    1 lumix
    2 dmc
    3 viera
    4 ௡լ
    5 diga
    6 ϓϥζϚ
    7 pdp
    8 ύφιχοΫ
    9 ࡾ༸
    10 btob
    11 ༗ػ
    12 ࡾ༸ిػ
    13 দԼ
    14 େ௶
    15 ໊ࣾ
    16 el
    17 ి޻
    18 ి஑
    19 ces
    20 ిػ
    21 avc
    22 sanyo
    23 ϞόΠϧίϛχέʔγϣϯζ
    24 ύωϧ
    25 ӷথ

    View Slide

  45. ΫΤϦ֦ுͷޮՌɿ

    ಉٛޠʢγϊχϜʣΛࣗಈ֫ಘ
    1 lumix
    2 dmc
    3 viera
    4 ௡լ
    5 diga
    6 ϓϥζϚ
    7 pdp
    8 ύφιχοΫ
    9 ࡾ༸
    10 btob
    11 ༗ػ
    12 ࡾ༸ిػ
    13 দԼ
    14 େ௶
    15 ໊ࣾ
    16 el
    17 ి޻
    18 ి஑
    19 ces
    20 ిػ
    21 avc
    22 sanyo
    23 ϞόΠϧίϛχέʔγϣϯζ
    24 ύωϧ
    25 ӷথ

    View Slide

  46. εϓϥτΡʔϯͷؔ࿈Ωʔϫʔυ
    1 ϒΩ
    2 εϓϥτΡʔϯ
    3 splatoon
    4 γϡʔλ
    5 amiibo
    6 wiiu
    7 φϫόϦότϧ
    8 ΠϯΫ
    9 Θ͔͹
    10 tps
    11 ࢼࣹ
    12 ృΔ
    13 γΦΧϥʔζ
    14 νϟʔδϟ
    15 fps
    16 ϩʔϥ
    17 pic
    18 ཱͪճΓ
    19 ΠΧ
    20 ϥϯΫ
    21 Ϛοϓ
    22 ରઓ
    23 ΤΠϜ
    24 ϚϦΦ
    25 ఢ
    γϡʔλʔ -> γϡʔλ

    ϩʔϥʔ -> ϩʔϥ

    ͱͳ͍ͬͯΔͷ͸

    Elasticsearch ͷΞφϥΠβʔ

    Ͱͷεςϛϯάॲཧͷ݁Ռ

    View Slide

  47. ΫΤϦ֦ுͷޮՌɿ

    ৽ޠͰ΋ಉٛޠʢγϊχϜʣΛ֫ಘ
    1 ϒΩ
    2 εϓϥτΡʔϯ
    3 splatoon
    4 γϡʔλ
    5 amiibo
    6 wiiu
    7 φϫόϦότϧ
    8 ΠϯΫ
    9 Θ͔͹
    10 tps
    11 ࢼࣹ
    12 ృΔ
    13 γΦΧϥʔζ
    14 νϟʔδϟ
    15 fps
    16 ϩʔϥ
    17 pic
    18 ཱͪճΓ
    19 ΠΧ
    20 ϥϯΫ
    21 Ϛοϓ
    22 ରઓ
    23 ΤΠϜ
    24 ϚϦΦ
    25 ఢ
    ٯʹΞϧϑΝϕοτʮsplatoonʯ

    ͷؔ࿈ΩʔϫʔυΛग़͢ͱ

    ΧλΧφʮεϓϥτΡʔϯʯ
    ͕ग़ͯ͘Δʂ

    View Slide

  48. ϢχΫϩͷؔ࿈Ωʔϫʔυ
    1 ͠·ΉΒ
    2 ༄Ҫ
    3 Ϣχ
    4 ώʔτςοΫ
    5 Ϋϩ
    6 uniqlock
    7 uniqlo
    8 ແҹ
    9 ྑ඼
    10 ҥྉ
    11 ض؋
    12 νϊ
    13 μα͍
    14 ϑΝʔετϦςΠϦϯά
    15 ࢒ۀ
    16 ηʔλ
    17 δʔϯζ
    18 tγϟπ
    19 ళ௕
    20 Ξ΢λ
    21 ཭৬
    22 ෰
    23 ඼࣭
    24 ඦ՟ళ
    25 ϑϦʔε

    View Slide

  49. ΫΤϦ֦ுͷޮՌɿ

    ܗଶૉղੳϛεͷิ׬
    1 ͠·ΉΒ
    2 ༄Ҫ
    3 Ϣχ
    4 ώʔτςοΫ
    5 Ϋϩ
    6 uniqlock
    7 uniqlo
    8 ແҹ
    9 ྑ඼
    10 ҥྉ
    11 ض؋
    12 νϊ
    13 μα͍
    14 ϑΝʔετϦς
    ΠϦϯά
    15 ࢒ۀ
    16 ηʔλ
    17 δʔϯζ
    18 tγϟπ
    19 ళ௕
    20 Ξ΢λ
    21 ཭৬
    22 ෰
    23 ඼࣭
    24 ඦ՟ళ
    25 ϑϦʔε
    ܗଶૉղੳϛεΛิ׬Ͱ͖Δ
    ʢϢχΫϩ͸ࣙॻʹͳ͍ޠʣ
    ܗଶૉղੳ͸݁Ռ͕ΏΕΔ
    ʮϢχ/Ϋϩʯͱ෼ׂ͞Εͯ
    ΠϯσοΫεʹొ࿥͞Ε͍ͯͯ
    ΫΤϦ͸ʮϢχΫϩʯͱ1τʔΫϯ
    ʹͳͬͨͱͯ͠΋ݕࡧͰ͖Δ
    ܗଶૉղੳثͷۤखͳύλʔϯΛ
    ࣗಈతʹֶशɾิ׬

    View Slide

  50. ؔ࿈ΩʔϫʔυʹΑΔ
    ΫΤϦ֦ுͷݶք

    View Slide

  51. ͢΂ͯͷݕࡧχʔζΛ

    ΧόʔͰ͖ΔΘ͚Ͱ͸ͳ͍
    1 ᷫԂ
    2 ࣉ
    3 075
    4 ਆࣾ
    5 ࡩ
    6 ژ
    7 ۙమ
    8 খ࿏
    9 ߚ༿
    10 ඒज़ؗ
    11 ্ژ
    12 ఻౷
    13 ӊؙ
    14 ௨
    15 ொՈ
    16 ΤϦΞ
    17 ཱྀؗ
    18 ·ͪ
    19 ത෺ؗ
    20 ల
    21 ؍ޫ
    22 Տݪொ
    23 kyoto
    24 ΧϑΣ
    25 தژ
    ʮژ౎ͷఱؾʯΛ஌Γ͍ͨਓʹ͸ແҙຯ
    ࣌ؒతɾۭؒతܭࢉίετͷ੍໿͕͋ΔͨΊ
    ແݶʹ͸ΫΤϦ֦ுͰ͖ͳ͍

    View Slide

  52. ͜͜·Ͱͷ·ͱΊ
    • ؔ࿈Ωʔϫʔυͷநग़͸ɺ͏·͍͍ͬͯ͘ΔΑ͏ʹݟ͑Δ
    • ΫΤϦ֦ுʹΑͬͯɺϢʔβʹΑΔΫΤϦ࡞੒ΛิॿͰ͖Δ
    • ΫΤϦ֦ுʹ͸ݶք΋͋Δ
    • ͨͩ͠ಉٛޠల։΍ܗଶૉղੳϛεͷิ׬ͳͲϝϦοτ͕େ͖͍
    Ͳ͏΍ͬͯؔ࿈ΩʔϫʔυΛநग़͔ͨ͠ʁ

    View Slide

  53. ಛ௃ޠબ୒ʹΑΔ

    ؔ࿈Ωʔϫʔυநग़

    ʢΞϧΰϦζϜฤʣ

    View Slide

  54. ؔ࿈Ωʔϫʔυͱ͸
    ؔ࿈Ωʔϫʔυ͸ɺҎԼͷಛ௃Λ΋͍ͬͯΔʢͱߟ͑ͨʣ
    • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨจॻͷ಺༰Λද͍ͯ͠Δޠ
    (= ಛ௃ޠ)
    • ͦͷ֓೦ʹ͍ͭͯड़΂ΒΕͨෳ਺ͷจॻʹݱΕΔ
    ·ͣɺͻͱͭͷจॻʢΤϯτϦʔʣ͔Β

    ಛ௃ޠΛநग़͢Δ͜ͱΛߟ͑Δ

    View Slide

  55. ಛ௃ޠநग़ͷํ਑
    ػցֶशʢeg. ϥϯΫֶशʣ͸࢖Θͳ͍
    • ؔ܎ऀ΁ͷઆ໌ɺ݁Ռͷղऍ͕೉͘͠ͳΔ
    • ਓखʹΑΔϧʔϧɾώϡʔϦεςΟοΫʹରԠ͠ʹ͍͘
    • ༻ҙͰ͖Δσʔλ͕গͳ͍ͨΊɺաֶश͢ΔՄೳੑ͕ߴ͍
    • ݹయతͳ৘ใݕࡧ (Information Retrieval) ͷख๏ʹཔΔ
    • Elasticsearch ͷ Term Vectors API ͰऔಘͰ͖Δɺ

    λʔϜʢ୯ޠʣͷ౷ܭ৘ใΛར༻

    View Slide

  56. Elasticsearch Term Vectors API
    Term Vectors ͔ΒऔಘͰ͖ΔλʔϜͷ౷ܭ৘ใ
    λʔϜ = Elasticsearch ͷΞφϥΠβʔʹΑͬͯ෼ׂ͞Εͨ΋ͷ

    ʢ͍ΘΏΔ୯ޠͱҰக͠ͳ͍৔߹΋͋ΔͷͰ஫ҙʣ
    term_freq ΤϯτϦʔதͷλʔϜͷग़ݱճ਺ʢස౓ʣ
    doc_freq ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺
    ttf ΠϯσοΫεશମͰλʔϜ͕ݱΕΔग़ݱճ਺ʢස౓ʣͷ࿨
    doc_count ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺

    View Slide

  57. Elasticsearch Ͱͷ
    ؔ࿈Ωʔϫʔυநग़ͷྲྀΕ
    1. ʢλάݕࡧʣೖྗ͞ΕͨΫΤϦͰɺλάͷϑΟʔϧυʹରͯ͠

    Filtered Query ͰߜΓࠐΈݕࡧ
    2. ʢ౷ܭ৘ใऔಘʣTerm Vectors API (ݫີʹ͸ Multi Term Vectors API) Ͱ

    λʔϜͷ౷ܭ৘ใΛऔಘ
    3. ʢಛ௃ޠநग़ʣͻͱͭͻͱͭͷจॻʹରͯ͠ɺλʔϜͷॏཁ౓ΛαʔόͰܭࢉ͠

    Top-25 ͷಛ௃ޠΛநग़
    4. ʢؔ࿈Ωʔϫʔυநग़ʣগ਺ͷจॻʹ͔͠ݱΕͳ͍λʔϜ͸མͱ͠ɺ

    ࠷΋είΞ͕ߴ͍ Top-25 ͷλʔϜΛநग़
    ʢؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹ɺ͋ͱ͸ΫΤϦʹΩʔϫʔυΛ଍͚ͩ͢ʣ
    λʔϜͷॏཁ౓ΛͲͷΑ͏ʹܭࢉ͢Δ͔ʁ
    Tips
    ೋ෼ώʔϓΛ࢖ͬͯ
    Top-K ܭࢉΛߴ଎Խ

    View Slide

  58. ಛ௃ޠͷܭࢉΞϧΰϦζϜ
    ৘ใݕࡧͰ͸ɺλʔϜͷॏཁ౓ʢॏΈ෇͚ʣͷࢦඪͱͯ͠͸

    ҎԼͷ 2 ͕ͭσϑΝΫτ
    • TF-IDF ( TF:୯ޠස౓ ͱ IDF:จॻස౓ͷٯ਺ ͷֻ͚ࢉ )
    • BM25 ( ֬཰Ϟσϧʹج͖ͮɺจॻ௕΋ߟྀͨ͠΋ͷ )
    TF-IDF Λ࠾༻
    BM25 ͸ܭࢉίετ͕ߴ͍ & 2ͭͷύϥϝʔλௐ੔͕ඞཁ

    View Slide

  59. TF-IDF
    TF-IDF ͷܭࢉࣜ ʢ͍͔ͭ͘ͷόϦΤʔγϣϯ͕͋Δʣ
    TF-IDF ͚ͩͰ͸͏·͍͔͘ͳ͍
    • "1" ͷΑ͏ͳɺ͋·Γҙຯͷͳ͍਺ࣈ
    • "to" ͷΑ͏ͳɺӳޠͷετοϓϫʔυ

    ʢ೔ຊޠͷετοϓϫʔυ͸ɺ Elasticsearch ͷϑΟϧλͰ஄͔ΕΔʣ
    TF : λʔϜස౓ fi,j → จॻதʹԿճ΋ग़ͯ͘ΔλʔϜ΄Ͳߴ͘ͳΔ
    IDF : จॻස౓ ni
    ͷٯ਺ʢN ͸จॻ਺ʣ→ ϨΞͳλʔϜ΄Ͳߴ͘ͳΔ
    fi,j : ΤϯτϦʔ j ʹݱΕΔλʔϜͷग़ݱճ਺ʢස౓ʣ = term_freq
    N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count
    ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq

    View Slide

  60. ಈతετοϓϫʔυ

    (Dynamic stop word list)
    ετοϓϫʔυࣙॻΛࣄલʹ༻ҙɾϝϯς͢Δͷ͸ݶք͕͋Δʢ೔ຊޠɾӳޠɾɾɾʣ
    λʔϜͷ౷ܭ৘ใ͔ΒɺετοϓϫʔυΛಈతʹܭࢉ

    ʢΠϯσοΫεʹొ࿥͞Ε͍ͯΔΤϯτϦʔʹରͯ͠ॊೈʹରԠͰ͖Δʣ
    IDF ݹయతͳख๏
    RIDF ػೳޠɾ಺༰ޠͷࠩʹ஫໨ͨ͠ख๏
    Gain ৘ใྔͷརಘʹ஫໨ͨ͠ख๏
    ͜ΕΒͷࢦඪ͕

    ͋Δᮢ஋ΑΓ

    ௿͍λʔϜΛ

    ετοϓϫʔυ

    ͱͯ͠ഉআ

    View Slide

  61. IDF
    Inversed Document Frequency
    ΄ͱΜͲͷจॻʹग़ͯ͘ΔΑ͏ͳλʔϜ͸είΞ͕௿͘ͳΔ
    N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count
    ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq

    View Slide

  62. RIDF
    Residual IDF; ࢒ࠩ IDF
    Church, K. W. and Gale, W. A. (1995a). “Inverse Document Frequency (IDF): A Measure of Deviation from Poisson.” In Proc.
    of the 3rd Workshop on Very Large Corpora, pp. 121–130.
    N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count
    ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
    Fi : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔ૯਺ = ttf
    1 จॻதͷλʔϜͷग़ݱճ਺Λ

    ϙΞιϯ෼෍ͰϞσϧԽ
    ػೳޠɿଟ਺ͷจॻʹ͹Β͚ͯଘࡏʢۉҰʹ෼෍ʣ
    ಺༰ޠɿগ਺ͷจॻʹूதͯ͠ଘࡏʢภͬͯ෼෍ʣ
    RIDF = ਪఆͨ͠ IDF ͱ࣮ࡍͷ IDF ͱͷࠩ
    ϙΞιϯ෼෍ P(k; λi) ͷύϥϝʔλ(=ظ଴஋) λi
    ͸
    λʔϜͷશग़ݱճ਺ (Fi) / จॻ਺ (N)
    ͰਪఆʢλʔϜ͕ۉҰʹ෼෍͍ͯ͠ΔͱԾఆʣ
    P(0; λi) ͸ɺͦͷλʔϜ͕ 1 ճ΋ग़ͯ͜ͳ͍֬཰
    ͭ·Γ 1 - P(0; λi) ͸ 1 ճͰ΋ग़ͯ͘Δ֬཰
    → ෼฼ N (1 - P(0; λi)) ͸จॻස౓ ni
    ͷਪఆ஋
    → ӈล͸ IDFi
    ͷਪఆ஋
    RIDF ͕௿͍
    RIDF ͕ߴ͍
    ਪఆ஋ͱͷ͕ࠩখ
    ਪఆ஋ͱͷ͕ࠩେ

    View Slide

  63. Gain
    ۃ୺ʹߴස౓ͷλʔϜʢ΄ͱΜͲͷΤϯτϦʔʹग़ͯ͘Δʣ

    ۃ୺ʹ௿ස౓ͷλʔϜʢ਺ݸͷΤϯτϦʔʹ͔͠ग़ͯ͜ͳ͍ʣͰείΞ͕௿͘ͳΔ
    ݁ՌɿޮՌ͕ͳ͔ͬͨ

    ߴස౓ͷλʔϜ͸ɺ Elasticsearch ʹΑͬͯɺ͢ͰʹϑΟϧλ͞Ε͍ͯΔ
    N : ΠϯσοΫεʹೖ͍ͬͯΔΤϯτϦʔͷ૯਺ = doc_count
    ni : ΠϯσοΫεશମͰλʔϜ͕ݱΕΔΤϯτϦʔͷ਺ = doc_freq
    Papineni, K. (2001). “Why Inverse Document Frequency?” In Proc. of the 2nd Meeting of the North American Chapter of the
    Association for Computational Linguistics (NAACL 2001), pp. 25–32.

    View Slide

  64. ·ͱΊɿಛ௃ޠநग़
    • ಛ௃ޠநग़ͷΞϧΰϦζϜΛղઆ
    • TF-IDF
    • ಈతετοϓϫʔυͰετοϓϫʔυࣙॻͷϝϯςΛෆཁʹ
    • IDF
    • RIDF
    • Gain

    View Slide

  65. ಛ௃ޠબ୒ͷධՁͱ࠷దԽ
    • Top-50 ͰධՁ
    • ద߹ͯ͠Δ (+1) ͔ɺ͍ͯ͠ͳ͍ (-1) ͔ͷ 2 ஋෼ྨͱΈͳ͢
    • ਖ਼ղσʔλΛ༻ҙʢ500ݸ΄Ͳʣ
    いく -1

    ゴッホ +1

    あんまり -1

    ...
    ܳज़ʹؔ͢ΔΤϯτϦʔ

    View Slide

  66. ࠓճ͸ MAP ͰධՁ & ࠷దԽ
    MAP 90% Ҏ্Λୡ੒

    ʢରτϨʔχϯάσʔλʣ
    ධՁࢦඪɹ[email protected] / AP / MAP

    MAP ʹΑΔ࠷దԽ
    [email protected]; Precision at n

    ୈ n Ґ·Ͱͷద߹཰
    • AP; Average Precision

    [email protected] Λ n ·ͰͰฏۉͨ͠ࢦඪ
    • MAP; Mean Average Precision

    AP Λ͢΂ͯͷΤϯτϦʔͰฏۉ
    Max MAP : 0.9173

    -----------------------

    IDF threshold : 6.0

    RIDF threshold : 0.55

    Gain threshold : 0.0
    Α͍͜͸ަࠩݕఆ͠·͠ΐ͏

    View Slide

  67. ·ͱΊɿධՁͱ࠷దԽ
    • 3 ͭͷධՁࢦඪ ([email protected] / AP / MAP) Λ঺հ
    • ධՁࢦඪ (MAP) Λ࠷େԽ͢ΔΑ͏ʹ

    ύϥϝʔλΛ࠷దԽ
    • 3ͭͷείΞؔ਺ (TF-IDF, IDF, RIDF) Λ૊Έ߹ΘͤΔ

    ͜ͱͰߴ͍ਫ਼౓Λୡ੒

    View Slide

  68. ෮शɿ֓೦ݕࡧͷΞ΢τϥΠϯ
    1. ؔ࿈Ωʔϫʔυநग़

    ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ

    ͢ΔΩʔϫʔυΛநग़
    2. ΫΤϦ֦ு

    ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ
    3. ֓೦ݕࡧ

    ֦ுͨ͠ΫΤϦͰશจݕࡧ
    ؔ࿈Ωʔϫʔυ͑͞நग़Ͱ͖Ε͹

    ͦΕΛΫΤϦʹ଍͚ͩ͢

    View Slide

  69. Elasticsearch ʹΑΔ

    ΫΤϦ֦ுˍ֓೦ݕࡧ
    • ݩͷΩʔϫʔυ͸ʮඞؚͣΉʯ

    ʢ Bool Query ͷ must અʣ
    • ؔ࿈Ωʔϫʔυ͸είΞʹ଍͞ΕΔ͚ͩ

    ʢ Bool Query ͷ should અʣ
    • είΞʹᮢ஋

    ʢ Query ʹ min_score Λࢦఆʣ

    View Slide

  70. ֓೦ݕࡧͷΞ΢τϥΠϯ
    1. ؔ࿈Ωʔϫʔυநग़

    ΦϦδφϧͷΫΤϦͱؔ࿈ʢڞىʣ

    ͢ΔΩʔϫʔυΛநग़
    2. ΫΤϦ֦ு

    ؔ࿈ΩʔϫʔυΛΫΤϦʹ௥Ճ
    3. ֓೦ݕࡧ

    ֦ுͨ͠ΫΤϦͰશจݕࡧ

    View Slide

  71. શମͷ·ͱΊ
    • ؔ࿈Ωʔϫʔυநग़ʹΑΔΫΤϦ֦ுͰద߹཰ΛߴΊͨ
    • ೖྗΩʔϫʔυ͕ 1 ͭͰ΋͸ͯϒͷλά৘ใͰิ׬
    • ৽ணॱͰ΋ߴ͍ద߹཰

    View Slide

  72. ͸ͯϒશจݕࡧͷ͜Ε͔Β
    • ݕࡧਫ਼౓ͷධՁɾύϥϝʔλͷ࠷దԽ
    • ద߹཰͸ߴ͍ϨϕϧΛอͪͭͭ࠶ݱ཰Λ޲্
    • Elasticsearch ʹղੳ༻ͷϑΟʔϧυɾࣙॻΛ௥Ճ
    • ղੳϛεΛ͓͑͞ɺ͞ΒͳΔਫ਼౓޲্΁
    • neologd ͷΑ͏ͳ৽ޠࣙॻΛݕ౼
    • ؔ࿈Ωʔϫʔυநग़͸༷ʑͳԠ༻͕ظ଴Ͱ͖Δ

    View Slide

  73. ࢀߟจݙ
    TF-IDF ͷόϦΤʔγϣϯ, BM25, Feature Selection, Information Gain, ٖࣅద߹ੑϑΟʔυόοΫ, ϥϯΫֶशͳͲ
    R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
    Recall, Precision, [email protected], AP, MAP, Binary heap ʹΑΔ Top-K ͳͲ
    Büttcher S, Clarke C, Cormack GV.Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press,
    2010.
    ϙΞιϯ෼෍Ϟσϧ, IDF, RIDF ͳͲ
    Manning, C. D., & Schutze, H. Foundations of statistical natural language processing. The MIT Press, 1999.
    IDF, RIDF ʹΑΔ Dynamic stop word list
    Amati, G., Carpineto, C., Romano, G. (Eds.). Advances in Information Retrieval, 29th European Conference on IR
    Research, ECIR 2007, Rome, Italy, April 2-5, 2007, Proceedings. Lecture Notes in Computer Science Springer Volume
    4425, 2007.
    IDF, RIDF ʹΑΔࡧҾޠͷॏΈ෇͚
    ๺, ௡ా, ࢰʑງ. ৘ใݕࡧΞϧΰϦζϜ, ڞཱग़൛, 2002.

    View Slide