Upgrade to Pro — share decks privately, control downloads, hide ads and more …

極大部分文字列による関連フレーズ抽出とその応用 / Related Keyphrase Extraction by Maximal Substrings

極大部分文字列による関連フレーズ抽出とその応用 / Related Keyphrase Extraction by Maximal Substrings

Takuya Asano

January 26, 2017
Tweet

More Decks by Takuya Asano

Other Decks in Technology

Transcript

  1. ۃେ෦෼จࣈྻʹΑΔ

    ؔ࿈ϑϨʔζநग़ͱͦͷԠ༻
    i d
    :
    t a
    k u y a
    -
    a

    @ t a
    k u y a
    _
    a
    ٕज़ษڧձ


    2017-01-26

    View Slide

  2. id:takuya-a
    ΞϓϦέʔγϣϯΤϯδχΞ
    2015 ೥ 4 ݄ೖࣾ


    ڵຯ
    • ৘ใݕࡧ


    • ࣗવݴޠॲཧ


    • ػցֶश


    OSS ׆ಈ
    kuromoji.js ͱ͔

    View Slide

  3. ࣗಈΩʔϑϨʔζநग़

    Automatic keyphrase extraction
    • จॻͷຊจ͔Βॏཁ͔ͭओ୊ʹ͋͏ϑϨʔζ

    Λࣗಈબ୒͢Δ NLP λεΫ (Turney 2000)
    ԿͳΜͩΑ೔ຊɻ

    Ұԯ૯׆༂ࣾձ͡ΌͶʔͷ͔Αɻ

    ࡢ೔ݟࣄʹอҭԂམͪͨΘɻ
    • ݻ༗දݱநग़ (NER)


    • ओ୊ʹؔ܎ͳ͍΋ͷ΋औΓग़͢


    • ΤϯςΟςΟϦϯΩϯά (Entity Linking)


    • ΩʔϑϨʔζͱΤϯςΟςΟ (Wikipedia ͷλΠτϧͳͲ) Λ݁ͼ͚ͭΔ
    http://anond.hatelabo.jp/20160215171759

    View Slide

  4. • ୯ޠʢܗଶૉʣͷ୯ҐͰ͸ࡉ͔͗͢Δ


    • ܗଶૉղੳ͚ͩͰ͸ҙຯͷ͋Δ·ͱ·Γ͕औΕͳ͍͜ͱ͕͋Δ


    • ਓ໊: ্_നੴ_๖_Ի


    • ஍໊: ӊؙ_ޚ஑


    • ࡞඼໊: ͋ͷ_೔_ݟ_ͨ_Ֆ_ͷ_໊લ_Λ_๻ୡ_͸_·ͩ_஌Β_ͳ͍_ɻ
    ࣗಈΩʔϑϨʔζநग़

    Automatic keyphrase extraction

    View Slide

  5. • ࣮༻ੑͷߴ͍ NLP λεΫ


    • ςΩετΛѻ͏αʔϏεͳΒ͍Ζ͍Ζ࢖͍ॴ͕͋Δ


    • ϒοΫϚʔΫɺϒϩάɺΧΫϤϜɺBS͸ͯͳ...


    • ΑΓҙຯʹ͍ۙ୯ҐͰ୯ޠΛѻ͑Δ͜ͱͰʮਂ͍ʯղੳ͕Մೳʹ


    • ʢࡉ͔͍୯ҐͰे෼ͳ͜ͱ΋͋ΔͷͰ༻్ʹΑΔʣ
    ࣗಈΩʔϑϨʔζநग़

    Automatic keyphrase extraction

    View Slide

  6. • ղ͔Εͨͱ͸ݴ͍͕͍ͨ


    • ӳޠ͔ͭ࠷ߴੑೳͷख๏Ͱ΋είΞ͸௿͍ (Hasan & Ng 2014)


    • ద߹཰ 27.2 ~ 35.0%


    • ࠶ݱ཰ 27.8 ~ 66.0%
    • F஋ 27.5 ~ 45.7%


    • ʢίʔύεʹΑ͚ͬͯͬ͜͏มΘΔʣ


    • ೔ຊޠͩͱ΋ͬͱݫ͘͠ͳΓͦ͏
    ࣗಈΩʔϑϨʔζநग़

    Automatic keyphrase extraction
    ੡඼ʹ

    ଱͑Δਫ਼౓Λ

    Ί͍ͨ͟͠

    View Slide

  7. Well known APIs
    • Yahoo! JAPAN ςΩετղੳ Web API


    • ΩʔϑϨʔζநग़ API


    • Microsoft Cognitive Services


    • Text Analytics API


    • ଞʹ΋͋Δ͔΋

    View Slide

  8. Yahoo! ΩʔϑϨʔζநग़ API
    • http://developer.yahoo.co.jp/webapi/jlp/keyphrase/v1/extract.html

    View Slide

  9. • อҭԂམͪͨͰࢼͨ͠
    Yahoo! ΩʔϑϨʔζநग़ API
    "อҭԂ": 100,

    "ࣇಐख౰": 79,

    "࿫࿎": 76,

    "ΰϚϯ": 63,

    "൒෼ҐΫϏ": 58,

    "ࢠڙ࢈Ή΍ͭͳΜ͔͍": 55,

    "ΤϯϒϨϜ": 55,

    "΢νϫ": 54,

    "ࠃձٞһ": 48,

    "গࢠԽ": 47,

    "ΦϦϯϐοΫ": 45,

    "σβΠφʔ": 44,

    "೔ຊ": 42,

    "Ϝγ": 41,

    "ࡒݯ": 40,

    "ࣇಐख౰20ສ": 39,

    "ແঈ": 36,

    "Ұԯ૯׆༂ࣾձ": 35,

    "අ༻શͯ": 34,

    "੫ۚ": 33

    View Slide

  10. Yahoo! ΩʔϑϨʔζநग़ API
    ʰ͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻʱ͸ɺ
    A-1 Pictures੍࡞ͷ೔ຊͷςϨϏΞχϝ࡞඼ɻ2011೥
    4݄͔Β6݄·ͰϑδςϨϏɾϊΠλϛφ࿮ͳͲͰ์ૹ
    ͞Εͨɻશ11࿩ɻུশ͸ʮ͋ͷՖʯɺʮ͋ͷ͸ͳʯɻ
    2012೥ʹອըԽɺ2013೥8݄31೔ʹ͸ܶ৔൛͕ެ։͞
    Εͨɻ
    https://ja.wikipedia.org/wiki/͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻ
    औΕͯͳ͍

    View Slide

  11. Yahoo! ΩʔϑϨʔζநग़ API

    View Slide

  12. MS Cognitive Services
    • https://text-analytics-demo.azurewebsites.net/

    View Slide

  13. http://anond.hatelabo.jp/20160215171759

    View Slide

  14. https://ja.wikipedia.org/wiki/͋ͷ೔ݟͨՖͷ໊લΛ๻ୡ͸·ͩ஌Βͳ͍ɻ
    MS Cognitive Services
    • ʮ͋ͷ೔ݟͨ…ʯ΋΍ͬͺΓऔΕͳ͍

    View Slide

  15. MS Cognitive Services

    View Slide

  16. ΩʔϑϨʔζநग़Λ

    ࣗલͰ࣮૷͢Δ

    View Slide

  17. ΩʔϑϨʔζநग़Λ࣮૷͢Δ
    • ΍ͬͺΓࣗ෼ͨͪͰ࣮૷͍ͨ͠


    • ༻్͝ͱʹνϡʔχϯά͍ͨ͠


    • ͸ͯͳϒοΫϚʔΫͷେن໛ςΩετσʔλ͕࢖͑Δ


    • શจॻ͕ Elasticsearch ʹࡌ͍ͬͯΔʂʂ


    • จࣈྻΞϧΰϦζϜͷग़൪


    • ݱ࣮తͳܭࢉ࣌ؒͰɺݱ࣮తͳਫ਼౓Λग़͢


    • ίετΛ཈͑Δ

    View Slide

  18. ܭࢉྔ͕໰୊ʹͳΔέʔε
    1. ඇৗʹେ͖͍ςΩετΛѻ͏৔߹
    2. จॻू߹શମ͔ΒΩʔϑϨʔζΛநग़͍ͨ͠৔߹
    • ͳʹ͔ͷج४ͰߜΓࠐΜͩจॻू߹


    • ྫɿ࠷ۙ3೔ؒͷΤϯτϦʔ͔ΒΩʔϑϨʔζΛܭࢉ


    • ྫɿΩʔϫʔυʢྫɿ͋ͷՖʣͷݕࡧ݁Ռ͔ΒΩʔϑϨʔζΛܭࢉ
    Es ͷ೚ҙͷΫΤϦ݁Ռʹద༻Ͱ͖ΔͷͰɺ͍Ζ͍ΖԠ༻Ͱ͖Δ

    View Slide

  19. ΩʔϑϨʔζநग़ͷྲྀΕ
    1. ީิΩʔϑϨʔζͷநग़
    • ͢΂ͯͷ෦෼จࣈྻΛީิʹ͢Δͱީิ਺͕ n^2 ʹͳΔ


    • ։࢝Ґஔ n ύλʔϯ x ௕͞ n ύλʔϯ


    • ਖ਼ղΩʔϑϨʔζΛΧόʔͭͭ͠ɺܭࢉྔΛ཈͑Δ޻෉͕ඞཁ


    • ޙଓͷύΠϓϥΠϯͰͷܭࢉΛߴ଎ʹߦ͍͍ͨ


    2. ΩʔϑϨʔζͷείΞϦϯά
    • ͦΕͧΕͷީิΩʔϑϨʔζʹରͯ͠είΞΛ͚ͭΔ


    • είΞͷᮢ஋΍݅਺ͳͲͰείΞ্ҐͷΩʔϑϨʔζΛબ୒
    ʢৄ͘͠͸ Hasan & Ng 2014, 3. Keyphrase Extraction Approaches Λࢀরʣ

    View Slide

  20. ϑΣʔζ1: ީิΩʔϑϨʔζͷநग़
    • φΠʔϒͳํ๏: ͢΂ͯͷ෦෼จࣈྻΛߟ͑Δͱ O(n^2)


    • ݱ࣮తʹ͸ 5 τʔΫϯͱ͔Ͱଧͪ੾Δ͜ͱʹͳΔ


    • ௕͍ϑϨʔζ͸औΕͳ͍


    • ैདྷख๏: ώϡʔϦεςΟοΫͳϧʔϧϕʔε
    1. ࣙॻΛ࢖ͬͯετοϓϫʔυΛ͸͘͡ (Liu+ 2009)


    2. ඼ࢺྻύλʔϯʹϚον͢Δ΋ͷΛબ୒ (Mihalcea & Tarau 2004, Wan & Xiao 2008, Liu+ 2009)


    3. ޠኮ౷ޠύλʔϯʹϚον͢Δ΋ͷΛબ୒ (Nguyen and Phan 2009)


    4. Wikipedia λΠτϧͷ෦෼จࣈྻ (n-gram) ʹϚον͢Δ΋ͷΛબ୒ (Grineva+ 2009)
    ʢৄ͘͠͸ Hasan & Ng 2014, 3.1 Selecting Candidate Words and Phrases Λࢀরʣ

    View Slide

  21. ީิΩʔϑϨʔζྻڍͰͷ՝୊
    • ඼ࢺ͸ࡉ͔͘ݟͳ͍ͱਫ਼౓্͕Βͳ͍

    View Slide

  22. ඼ࢺ͋ͯήʔϜ
    • ໊ࢺʁಈࢺʁॿࢺʁॿಈࢺʁ෭ࢺʁ࿈ମࢺʁ


    • ʮͦ͏ʯɿʁʁʁ


    • ʮ͍Θ͘ʯɿʁʁʁ


    • ʮ͝ཡʯɿʁʁʁ


    • ʮ͖Βͼ΍͔ʯɿʁʁʁ

    View Slide

  23. ඼ࢺ͋ͯήʔϜ
    • ໊ࢺʁಈࢺʁॿࢺʁॿಈࢺʁ෭ࢺʁ࿈ମࢺʁ


    • ʮͦ͏ʯɿ໊ࢺ


    • ʮ͍Θ͘ʯɿ໊ࢺ


    • ʮ͝ཡʯɿ໊ࢺ


    • ʮ͖Βͼ΍͔ʯɿ໊ࢺ
    ͥΜͿ

    ໊ࢺʂ
    ͦΕ͸ͨͿΜ͋ͳͨͷཉ໊͔ͬͨ͠ࢺͰ͸ͳ͍ - ԡͯ͠μϝͳΒ;ͯ৸͠Ζ http://ikawaha.hateblo.jp/entry/2016/05/20/155504

    View Slide

  24. ީิΩʔϑϨʔζྻڍͰͷ՝୊
    • ඼ࢺ͸ࡉ͔͘ݟͳ͍ͱਫ਼౓্͕Βͳ͍


    • ʮͦ͏ʯʮ͍Θ͘ʯʮ͝ཡʯʮ͖Βͼ΍͔ʯ


    • IPADic తʹ͸ͥΜͿ໊ࢺͰ͢


    • ඼ࢺͷࡉ෼ྨ·ͰύλʔϯʹೖΕΔͱσʔλ͕গͳ͘ͳΔ


    • ࣍ݩͷढ͍

    View Slide

  25. • ࣍ʑʹݱΕΔ৽ޠɾ৽ύλʔϯʹରԠ͢Δͷ͕େม


    • ܅_ͷ_໊_͸_ɻ


    • ि࣍ɾ೔࣍όονͰࣙॻΛߋ৽͚ͭͮ͠Δʁʁ
    ީิΩʔϑϨʔζྻڍͰͷ՝୊2

    View Slide

  26. • ࣙॻ΍ϧʔϧΛϝϯς͢Δͷ͕େม
    • ඼ࢺ΋ࡉ͔͘ݟͳ͍ͱ͍͚ͳ͍͕


    • ࡉ͔͘ΈΔͱσʔλ͸গͳ͘ͳΔ
    ީิΩʔϑϨʔζྻڍͰͷ՝୊

    ·ͱΊ
    • ςΩετू߹͔Β͍͍ײ͡ʹυϝΠϯదԠͯ͠΄͍͠


    • ͜ͷϑΣʔζͰ͸Χόʔ཰ʢ࠶ݱ཰ʣ͕࠷ॏཁ

    View Slide

  27. ۃେ෦෼จࣈྻʹΑΔ
    ީิΩʔϑϨʔζநग़

    View Slide

  28. ۃେ෦෼จࣈྻʹΑΔީิΩʔϑϨʔζநग़
    • සग़͢ΔϑϨʔζΛ΋Εͳ͘ྻڍ͍ͨ͠


    • ͜ͷϑΣʔζ̍Ͱ͸Χόʔ཰ʢ࠶ݱ཰ʣ͕࠷ॏཁ


    • ޙଓͷϑΣʔζ̎ͰϑΟϧλ͢Δ


    • ෦෼จࣈྻΛ·ͱΊͨʮ୅දʯΛߟ͑Δ͜ͱͰɺΧόʔ཰Λอͪͳ͕ΒީิΛݮΒ͢


    • ෦෼จࣈྻͲ͏͠ͷʮग़ݱҐஔʯʹΑΔแؚؔ܎ΛΈΔ


    • ͨͩ͠จࣈྻ௕ͷ͚ࠩͩͣΒͯ͠Ұக͢ΔͳΒಉ͡ͱ͢Δʢޙड़ʣ


    • ͢΂ͯͷ෦෼จࣈྻΛแؚؔ܎ͰάϧʔϓԽ͢Δ


    • άϧʔϓͰ࠷௕ͷ෦෼จࣈྻ͕ۃେ෦෼จࣈྻ

    View Slide

  29. • ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ


    • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ


    • { a, b, r, e, ab, br, re, abr, bre, abre }


    • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ


    • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ
    Y a b r e - K a b r e
    ۃେ෦෼จࣈྻͷྫʢ̍ʣ

    View Slide

  30. • ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ


    • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ


    • { a, b, r, e, ab, br, re, abr, bre, abre }


    • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ


    • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ
    Y a b r e - K a b r e
    ۃେ෦෼จࣈྻͷྫʢ̍ʣ

    View Slide

  31. • ۃେ෦෼จࣈ͸ abre ͨͩ̍ͭ


    • 2ճҎ্ݱΕΔ෦෼จࣈྻ͸͢΂ͯ abre ʹؚ·Ε͍ͯΔ


    • { a, b, r, e, ab, br, re, abr, bre, abre }


    • ͜ΕΒͷ෦෼จࣈྻ͕ಉ͡άϧʔϓ


    • ࠷௕ͷ abre ͕ۃେ෦෼จࣈྻ
    Y a b r e - K a b r e
    ۃେ෦෼จࣈྻͷྫʢ̍ʣ

    View Slide

  32. ۃେ෦෼จࣈྻͷྫʢ̎ʣ
    • ۃେ෦෼จࣈ͸ abra ͱ a


    • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ
    a b r a c a d a b r a

    View Slide

  33. ۃେ෦෼จࣈྻͷྫʢ̎ʣ
    • ۃେ෦෼จࣈ͸ abra ͱ a


    • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ
    a b r a c a d a b r a

    View Slide

  34. ۃେ෦෼จࣈྻͷྫʢ̎ʣ
    • ۃେ෦෼จࣈ͸ abra ͱ a


    • a ͸ abra ͷதҎ֎ʹ΋ग़ݱ͢ΔͷͰผάϧʔϓ


    • ab ͸ abra ͱಉ͡άϧʔϓ
    a b r a c a d a b r a

    View Slide

  35. ۃେ෦෼จࣈྻͷྫʢ̏ʣ
    • ۃେ෦෼จࣈ͸ shi ͱ
    a

    • i ͸̎จࣈͣΒ͢ͱ shi ͱग़ݱҐஔ͕Ұக͢ΔͷͰ shi ͱಉ͡άϧʔϓ
    s h i m o b a y a s h i
    a k a w a k a m i
    • ۃେ෦෼จࣈ͸ aka ͱ
    a

    • aka ͷதʹ̎ճ໨ʹݱΕΔ a ͸ग़ݱҐஔ͕ҟͳΔ

    View Slide

  36. ۃେ෦෼จࣈྻʹΑΔީิΩʔϑϨʔζྻڍ
    • ۃେ෦෼จࣈྻΛߟ͑Δͱɺ೚ҙ௕ͷසग़จࣈྻΛྻڍͰ͖Δ


    • ۃେ෦෼จࣈྻ͸ ∞-gram ͱ΋ݺ͹ΕΔ


    • ྻڍ͢Δͱ͖ɺͦΕͧΕͷۃେ෦෼จࣈྻ͕Կճग़ݱ͔͕ͨ͠Θ
    ͔Δ


    • ग़ݱճ਺ʹΑͬͯϑΟϧλͰ͖Δ


    • ઀ඌࣙ໦ʹ͓͚Δ಺෦ϊʔυ͕ۃେ෦෼จࣈྻʹରԠ


    • ͨͩ͠ɺશϊʔυ͕ۃେ෦෼จࣈྻʹͳΔΘ͚Ͱ͸ͳ͍ʢޙड़ʣ

    View Slide

  37. • ςΩετͷ͢΂ͯͷ઀ඌࣙ (suf
    fi
    x) ͷ Patricia Trie


    • ྫ: abracadabra$ ͷ઀ඌࣙ໦
    ઀ඌࣙ໦
    [Ԭ໺ݪ & ⁋Ҫ 08] Ԭ໺ݪ େี, ⁋Ҫ ५Ұ. "શͯͷ෦෼จࣈྻΛߟྀͨ͠จॻ෼ྨ", NL187 ࣗવݴޠॲཧݚڀձ 2008

    View Slide

  38. ઀ඌࣙ໦
    • BWT ΛݟΔͱۃେ෦෼จࣈྻ͔Ͳ͏͔νΣοΫͰ͖Δ [1]


    • BWT = ͜͜Ͱ͸ɺͦΕͧΕͷ઀ඌࣙͷલͷจࣈ


    • ઀ඌࣙ໦ͷ֤ϊʔυʹରԠ͢Δ઀ඌࣙͷ BWT ͕̎छྨ
    Ҏ্͔Βͳ͍ͬͯΔͱ͖ɺͦΕ͸ۃେ෦෼จࣈྻ
    [1] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

    View Slide

  39. ֦ு઀ඌࣙ഑ྻ (ESA)
    • ઀ඌࣙ໦্ͷૢ࡞Λಉ༷ͷܭࢉྔͰܭࢉͰ͖Δσʔλߏ଄


    • ઀ඌࣙ໦ͷϊʔυͷྻڍͳͲ


    • ֦ு઀ඌࣙ഑ྻ (ESA) = ઀ඌࣙ഑ྻ (SA) + ࠷௕ڞ௨઀಄ࣙ഑ྻ (LCP)


    • ςΩετ௕ n ʹରͯ͠ 9n bytes [1]


    • ઀ඌࣙ໦ (20n bytes~) ΑΓίϯύΫτ [2]
    [1] D. Okanohara and J. Tsujii. 2009. Text Categorization with All Substring Features. In the SIAM International Conference on Data Mining (SDM).


    [2] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. 2004. Replacing suf
    fi
    x trees with enhanced suf
    fi
    x arrays. J. Discrete Algs, 2:53–86.

    View Slide

  40. ۃେ෦෼จࣈྻͷྻڍ
    1. SA-IS ΞϧΰϦζϜͱ Kasai's algorithm ʹΑΓ ESA (SA + LCP) Λߏங


    2. BWT ͕มԽ͢Δ઀ඌࣙΛνΣοΫ [2]


    3. LCP Λ࢖ͬͯ಺෦ϊʔυΛྻڍ


    • ͜ͷͱ͖ BWT ΛνΣοΫͯ͠ۃେ෦෼จࣈྻͷΈྻڍ͢Δ
    • ઀ඌࣙ໦ͷ಺෦ϊʔυͷྻڍ͸ςΩετ௕ T ʹରͯ͠ઢܗ࣌ؒͰ࣮ߦͰ͖Δ [1]


    • BWT ͷมԽͷνΣοΫ΋ઢܗ࣌ؒͰՄೳ
    [1] T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park "LinearTime Longest-Common-Pre
    fi
    x Computation in Suf
    fi
    x Arrays and Its Applications", CPM 2001


    [2] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

    View Slide

  41. esaxx
    • ઀ඌࣙ໦ͷ಺෦ϊʔυΛྻڍ͢Δ C++ ϥΠϒϥϦ


    • ֦ு઀ඌࣙ഑ྻ (ESA) Λߏங͢Δ


    • ۃେ෦෼จࣈྻ͔Ͳ͏͔ͷνΣοΫ͸ [1] Λࢀߟʹࣗ෼Ͱ࣮૷͢Δ


    • https://code.google.com/archive/p/esaxx/
    [1] ۃେ෦෼จࣈྻ - Ξεϖ೔ه http://d.hatena.ne.jp/takeda25/20101202/1291269994

    View Slide

  42. ϑΣʔζ2

    ΩʔϑϨʔζͷείΞϦϯά

    View Slide

  43. ϑΣʔζ2: ΩʔϑϨʔζͷείΞϦϯά
    • ϑϨʔζͷείΞʢॏΈ෇͚ʣΛͲ͏ܭࢉ͢Δ͔


    • ୯ޠͷॏΈ෇͚ʹ͸͍Ζ͍Ζͳํ๏͕͋Δ


    • TF-IDF


    • JLH είΞ


    • ૬ޓ৘ใྔ


    • ΧΠೋ৐஋


    • φΠʔϒͳํ๏
    1. ϑϨʔζͷͳ͔ͷ୯ޠͷॏΈͷ࿨ΛͱΔ


    2. ϑϨʔζʹରͯ͠ʢ୯ޠͱಉ͡Α͏ʹʣॏΈ෇͚Λܭࢉ͢Δ

    View Slide

  44. ࣮ݧ
    1. ΩʔϫʔυͰ Elasticsearch Λݕ
    ࡧͯ͠จॻू߹Λऔಘ


    - ʮ͋ͷՖʯʮ܅ͷ໊͸ʯʮ೚ఱಊʯͳͲ


    2. จॻͷຊจΛऔಘ


    - ࠓճ͸ઌ಄ͷ 300 จࣈͷΈ


    3. MeCab Ͱܗଶૉղੳ


    - จࣈͰ͸ͳ͘ܗଶૉΛجຊ୯Ґͱ͢Δ
    ʢϊΠζ௿ݮʣ


    4. ۃେ෦෼จࣈྻΛܭࢉͯ͠ީิϑ
    ϨʔζΛྻڍ


    - 5ճҎ্ग़ݱ͢Δ΋ͷ͚ͩ
    5. ީิϑϨʔζʹରͯ͠ΧΠೋ৐஋Ͱ

    είΞϦϯά


    - Elasticsearch ͷϑϨʔζݕࡧΛ࢖͏


    - ҎԼͷ౷ܭྔ͔ΒܭࢉͰ͖Δ


    - શମͷจॻ਺


    - Ωʔϫʔυʹώοτͨ͠จॻ਺


    - ͦΕΒͷதͰީิϑϨʔζΛؚΉจॻ਺


    6. είΞ͕ Top-K ͷۃେ෦෼จࣈྻΛ
    ฦ͢


    - ࠓճ͸ 500 ݅

    View Slide

  45. ʮ͋ͷՖʯʹର͢Δ݁Ռ
    • ্Ґ20݅
    142684.106 Ώ͖ ͋ͭ


    135512.226 ௕Ҫ ཾ ઇ


    121007.208 Ξχϝ ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ


    119079.563 ʮ ΊΜ · ʯ


    118675.949 ɹ ͋ͷ Ֆ


    118675.949 ʰ ͋ͷ Ֆ


    118675.949 ʮ ͋ͷ Ֆ ʯ


    118675.949 ʰ ͋ͷ Ֆ ʱ


    118675.949 ʮ ͋ͷ Ֆ


    94760.745 ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ

    View Slide

  46. 86305.143 ʮ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍


    86305.143 ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍ ɻ


    86305.143 ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍


    86305.143 ʰ ͋ͷ ೔ ݟ ͨ Ֆ ͷ ໊લ Λ ๻ୡ ͸ ·ͩ ஌Β ͳ͍


    55090.753 ాத ক լ


    38692.776 Ԭా ຩ ཬ


    35751.909 ௕Ҫ


    29098.469 ʮ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ


    29098.469 ʰ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ ʱ


    29098.469 ʮ ৺ ͕ ڣͼ ͨ ͕ͬ ͯΔ Μ ͩ ɻ ʯ


    ʮ͋ͷՖʯʹର͢Δ݁Ռ

    View Slide

  47. ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ
    • ্Ґ80݅
    390586.996 ୍ ͱ ࡾ ༿


    126664.722 ʯ ʮ ܅ ͷ ໊ ͸ ɻ ʯ


    123792.563 өը ʮ ܅ ͷ ໊ ͸ ɻ ʯ


    123792.563 өը ʰ ܅ ͷ ໊ ͸ ɻ ʱ


    109792.731 ৽ւ


    106256.894 ɺ ৽ւ


    106256.894 ͷ ৽ւ


    106256.894 ɻ ৽ւ


    104401.937 ɻ ৽ւ ੣ ؂ಜ


    103768.965 ͷ ৽ւ ੣


    103768.965 ͨ ৽ւ ੣

    View Slide

  48. ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ
    83881.345 ɻ ʮ ܅ ͷ ໊ ͸ ɻ


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͷ


    83881.345 ɻ ʰ ܅ ͷ ໊ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͕ ɺ


    83881.345 ʰ ܅ ͷ ໊ ͸ ɻ ʱ


    83881.345 ɺ ʮ ܅ ͷ ໊ ͸ ɻ ʯ


    79635.412 ͷ ʮ ܅ ͷ ໊ ͸ ɻ


    79635.412 ͨ ɻ ʮ ܅ ͷ ໊ ͸


    79635.412 ͨ ʰ ܅ ͷ ໊ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ


    83881.345 ɻ ʮ ܅ ͷ ໊ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ


    83881.345 ɻ ʰ ܅ ͷ ໊ ͸ ɻ ʱ


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͕


    83881.345 ܅ ͷ ໊ ͸


    83881.345 ɻ ʰ ܅ ͷ ໊ ͸ ɻ


    83881.345 ɺ ܅ ͷ ໊ ͸


    83881.345 ɺ ܅ ͷ ໊ ͸ ɻ


    83881.345 ܅ ͷ ໊ ͸ ʁ


    83881.345 ʮ ܅ ͷ ໊ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ɻ ʯ ͸ ɺ


    83881.345 ɻ ܅ ͷ ໊ ͸


    83881.345 ʮ ܅ ͷ ໊ ͸ ʯ ͱ

    View Slide

  49. ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ
    68454.408 ɺ ࡾ༿


    66923.168 ࢳ क


    52875.263 ʮ ܅ ͷ ໊ ͸ ʯ


    48620.102 ࡾ༿


    40170.705 өը ʰ ܅ ͷ ໊ ͸ ɻ ʱ Ͱ ώϩΠϯ ͷ


    38947.540 ٶ ਫ


    38243.976 ࡾ ༿ ͸


    37993.993 ৽ւ ੣


    29207.758 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ݟ


    29207.758 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ݟ ͯ


    22217.179 ৽ւ ੣ ؂ಜ ࠷৽ ࡞


    22161.230 ৽ւ ੣ ͷ


    21938.554 ʮ લલ લੈ


    21155.167 ৽ւ ੣ ࡞඼


    19854.987 ৽ւ ؂ಜ


    19222.009 ٶ ਫ ࡾ ༿


    17993.221 ཱՖ ୍

    View Slide

  50. 17612.465 ʮ ܅ ͷ ໊ ͸ ɻ ʯ Λ ؍


    13315.859 ʮ εύʔΫϧ


    12497.399 ৽ւ ੣ ؂ಜ ࠷৽ ࡞ ʰ ܅ ͷ ໊ ͸ ɻ ʱ


    12043.648 ࢳ क ொ


    11079.889 ্ നੴ


    9984.243 ʮ ඵ଎ 5 ηϯνϝʔτϧ ʯ


    6407.954 ৽ւ ੣ ؂ಜ ͷ Ξχϝ өը


    5858.545 ৽ւ ੣ ؂ಜ


    5662.690 ʮ γϯ ɾ ΰδϥ ʯ


    5472.189 γϯ ɾ ΰδϥ


    3677.951 ୍ ͱ


    3454.078 ೖΕସΘΓ
    ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ

    View Slide

  51. 3407.773 ʮ ඵ଎


    2634.372 ɻ ʮ ܅ ͷ


    2634.372 ܅ ͷ


    2634.372 ɻ ܅


    2634.372 ɻ ܅ ͷ


    2492.065 ʮ ඵ଎ 5 ηϯνϝʔτϧ


    2036.764 ্ നੴ ๖ Ի


    1511.662 ͷ Ξχϝ өը


    1511.662 ͷ Ξχϝ өը ʮ


    1456.590 ʮ ܅ ͷ


    1274.915 2016 - 08 -


    1222.903 ԯ ԁ Λ ಥഁ


    1134.913 ͷ େ ώοτ


    1129.179 લલ લੈ
    ʮ܅ͷ໊͸ʯʹର͢Δ݁Ռ

    View Slide

  52. ʮ೚ఱಊʯʹର͢Δ݁Ռ
    • ্Ґ50݅ 956576.307 … ೚ఱಊ


    608643.451 ͸ ɺ ೚ఱಊ ͸


    473775.714 ؠా ૱


    464686.595 ؠా ૱ ࣾ௕ ͕


    458018.217 ͠ ɺ ೚ఱಊ


    458018.217 ೚ఱಊ ͷ


    458018.217 ɻ ೚ఱಊ ɺ


    458018.217 ·͠ ͨ ɻ ೚ఱಊ


    458018.217 ೚ఱಊ Λ


    458018.217 ೚ఱಊ *


    458018.217 ʹ ೚ఱಊ


    458018.217 ͨ ೚ఱಊ


    458018.217 ɻ ೚ఱಊ ͸


    404220.379 ̸̸̬

    View Slide

  53. ʮ೚ఱಊʯʹର͢Δ݁Ռ
    386131.784 ؠా ࣾ௕ ͷ


    385544.541 ͸ ɺ ೚ఱಊ ͕


    314103.697 ؠా ૱ ࣾ௕


    309061.454 ϚϦΦ ʯ


    275932.322 ؠా ࣾ௕


    271259.444 amiibo Λ


    229298.880 ؠా ࣾ௕ ͕


    226600.614 Ͱ͢ ͕ ɺ ೚ఱಊ


    219915.400 ͠ ͨ ɻ ؠా ࣾ௕


    217359.056 ɻ ɹ ؠా ࣾ௕ ͸


    217359.056 ɻ ɹ ؠా ࣾ௕


    217359.056 ɻ ؠా ࣾ௕


    205515.499 ϑΝϛϦʔ ίϯϐϡʔλ


    169882.335 ͯ ͍Δ ɻ ೚ఱಊ


    167779.063 New χϯςϯυʔ 3 DS

    View Slide

  54. 166283.753 ϚϦΦ ͷ


    162672.396 ܕ ήʔϜ


    143406.225 Miitomo


    139477.360 ϚϦΦ ϝʔΧʔ


    133729.277 ɺ ೚ఱಊ


    130604.759 גओ ૯ձ Λ ܽ੮


    124955.944 ೚ఱಊ ͷ ؠా ૱


    124945.709 ϚϦΦ Χʔτ


    119639.084 ɻ ʮ ϚϦΦ


    112837.829 ਾ͑ஔ͖ ܕ ήʔϜ ػ


    99299.695 ϛʔτϞ


    89169.832 ̏ ̨̙


    83948.098 ถࠃ ೚ఱಊ ͷ
    ʮ೚ఱಊʯʹର͢Δ݁Ռ

    View Slide

  55. ʮ೚ఱಊʯʹର͢Δ݁Ռ
    78462.915 ʮ ϚϦΦ ʯ


    67171.541 ೥຤ ঎ઓ


    66552.483 New χϯςϯυʔ 3 DS /


    66088.000 Mii


    64830.347 ਾ͑ஔ͖ ܕ


    63918.676 ʮ θϧμ ͷ ఻આ ʯ


    61309.998 ܅ౡ ࢯ


    60804.122 ٶຊ ࢯ


    58416.532 େ ཚಆ εϚογϡϒϥβʔζ

    View Slide

  56. ߟ࡯
    • ͳΜͱͳ͘είΞͱϑϨʔζͷΑ͞͸૬ؔͯͦ͠͏
    • ࠷ԼҐͷ΄͏͸͍͍ϑϨʔζ͕ͳ͍


    • ҰํͰɺ͍͍ϑϨʔζ͕த͘Β͍ʹ͋ͬͨΓ΋͢Δ


    • ه߸͕͚ͬ͜͏ϊΠζʹͳ͍ͬͯΔ


    • ه߸ɾॿࢺͱ͔Ͱ࢝·ͬͯΔ৔߹͸ϑΟϧλ͢Ε͹Αͦ͞͏


    • ϑϨʔζʹରͯ͠΋ TF-IDF ΍ΧΠೋ৐஋ͳͲ͸ҙຯΛ΋ͭͷ͔ʁ


    • ΋ͬͱੑೳͷྑ͍ϑϨʔζείΞϦϯάख๏͕ݚڀ͞Ε͍ͯΔ͔΋ʁ

    View Slide

  57. είΞϦϯάख๏ͷαʔϕΠ
    • ैདྷख๏ͷαʔϕΠ࿦จɿ [Hasan & Ng 2014]


    • ΩʔϑϨʔζநग़ख๏ͷ state-of-the-art (2014 ೥࣌఺)


    • ڭࢣ͋Γͷख๏ͱڭࢣͳ͠ͷख๏͕͋Δ


    • ڭࢣ͋Γ͕ੑೳ͕ߴ͍ͱ΋͍͑ͳ͍


    • 4ͭͷσʔληοτͷ͏ͪ3ͭͷ SOTA ͸ڭࢣͳ͠ [Hasan & Ng 2014]


    • ڭࢣ͋Γ͸ֶशσʔλΛ༻ҙͨ͠ΓϞσϧΛ؅ཧͨ͠Γ͍Ζ͍Ζେม

    View Slide

  58. ैདྷख๏ʢڭࢣͳ͠ʣ
    1. άϥϑϕʔεϥϯΩϯά
    • TextRank


    2. τϐοΫϕʔεΫϥελϦϯά
    1. KeyCluster


    2. Topical PageRank (TPR)


    3. Community Cluster


    3. ݴޠϞσϧϕʔε
    1, 2 ͸ֶश͕େมͦ͏
    1. άϥϑϕʔεɿάϥϑ͸୯ޠͲ͏͠
    ͷ૊Έ߹ΘͤͳͷͰ2৐Φʔμʔ


    2. τϐοΫϕʔεɿτϐοΫϞσϧʹ
    ͔͚Δͷ͕ॏ͍


    3. ݴޠϞσϧɿ୯ޠΛΧ΢ϯτ͍ͯ͠
    ͚ͩ͘ͳͷͰઢܗΦʔμʔ

    View Slide

  59. ݴޠϞσϧϕʔεͷϑϨʔζείΞϦϯά
    • ݴޠϞσϧʢʹ֬཰Ϟσϧʣؒͷҧ͍ͰείΞϦϯά


    • ΧϧόοΫɾϥΠϒϥʔɾμΠόʔδΣϯεͰଌΔ


    • ࢼͤͯͳ͍ͷͰ·ͨͷػձʹɾɾɾ
    Takashi Tomokiyo and Matthew Hurst. "A Language Model Approach to Keyphrase Extraction"

    View Slide

  60. ࢀߟจݙ
    • [Turney 00] Peter D. Turney. "Learning algorithms for keyphrase extraction", Information
    retrieval 2.4 (2000): 303-336


    • https://arxiv.org/pdf/cs/0212020.pdf


    • [Hasan & Ng 14] Kazi Saidul Hasan and Vincent Ng. "Automatic Keyphrase Extraction: A
    Survey of the State of the Art._ Proceedings of the 52nd Annual Meeting of the Association
    for Computational Linguistics (Volume 1: Long Papers)" 2014, pages 1262-1273


    • https://www.aclweb.org/anthology/P/P14/P14-1119.xhtml


    • ࣗಈΩʔϑϨʔζநग़ʹ͍ͭͯͷମܥతͳϨϏϡʔ࿦จ


    • [Liu+ 09] Z. Liu, P. Li, Y. Zheng and M. Sun. "Clustering to
    fi
    nd exemplar terms for
    keyphrase extraction", 2009, pp. 257–266


    • ީิΩʔϑϨʔζΛͭ͘Δͱ͖ɺετοϓϫʔυͷࣙॻΛ࢖ͬͯετοϓϫʔυΛ͸͡
    ͍͍ͯΔ

    View Slide

  61. • [Ԭ໺ݪ & ⁋Ҫ 08] Ԭ໺ݪ େี, ⁋Ҫ ५Ұ. "શͯͷ෦෼จࣈྻΛߟྀͨ͠จॻ෼ྨ",
    NL187 ࣗવݴޠॲཧݚڀձ 2008


    • http://ci.nii.ac.jp/naid/110006980330


    • [Okanohara & Tsujii 09] D. Okanohara and J. Tsujii. "Text Categorization with All
    Substring Features", In the SIAM International Conference on Data Mining (SDM) 2009


    • http://epubs.siam.org/doi/abs/10.1137/1.9781611972795.72


    • [Abouelhoda+ 04] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. "Replacing suf
    fi
    x
    trees with enhanced suf
    fi
    x arrays.", J. Discrete Algs 2004, 2:53–86.


    • https://pdfs.semanticscholar.org/4ca9/
    ea95a0a9846965e86619e646d9ca36930c18.pdf


    • [Kasai+ CPM 01] T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park "LinearTime
    Longest-Common-Pre
    fi
    x Computation in Suf
    fi
    x Arrays and Its Applications", CPM 2001
    ࢀߟจݙ

    View Slide