Upgrade to Pro — share decks privately, control downloads, hide ads and more …

専門用語抽出手法の研究と
抽出アプリケーションの開発

 専門用語抽出手法の研究と
抽出アプリケーションの開発

Koga Kobayashi

September 27, 2018
Tweet

More Decks by Koga Kobayashi

Other Decks in Programming

Transcript

  1. ઐ໳༻ޠநग़ख๏ͷݚڀͱ

    நग़ΞϓϦέʔγϣϯͷ։ൃ

    View Slide

  2. ࣗݾ঺հ
    • খྛᕣՏ: @kajyuuen
    • ஜ೾େֶ ৘ใֶ܈ 4೥
    • ݚڀ͸ࣗવݴޠॲཧɺػցֶश
    • ։ൃͰ͸Ruby on RailsΛΑ͘࢖͍·͢
    • झຯ
    • ΠϯλʔωοτɺԻָؑ৆ɺόΠΫ(͓ٳΈத)
    2

    View Slide

  3. ໨త
    3
    ڭࢣσʔλ͕গͳ͍ઐ໳υϝΠϯͷจষ͔Β

    ઐ໳༻ޠΛநग़͕ग़དྷΔγεςϜɾख๏ͷ։ൃ

    View Slide

  4. ઐ໳༻ޠͱ͸
    ઐ໳༻ޠʢͤΜ΋ΜΑ͏͝ʣͱ͸ɺ͋Δಛఆͷ৬ۀʹैࣄ͢Δऀ΍ɺ

    ͋Δಛఆͷֶ໰ͷ෼໺ɺۀք౳ͷؒͰͷΈ࢖༻͞Εɺ௨༻͢Δݴ༿ɾ༻ޠ܈Ͱ͋Δɻ
    ςΫχΧϧλʔϜʢӳޠ technical termʣͱ΋ݴΘΕΔɻ
    Wikipedia͔ΒͷҾ༻
    4
    ྫ: ίʔϧηϯλʔ
    • ΦϖϨʔλʔɺFAQɺVoCɺฏۉ௨࿩࣌ؒ
    ྫ: ྉཧ
    • ͍ͪΐ͏੾Γɺܡണ͖ɺࡾຕ͓Ζ͠

    View Slide

  5. എܠ
    ͔͠͠…
    • ҰൠͷυϝΠϯͰֶशͨ͠ϞσϧΛ

    ઐ໳υϝΠϯʹదԠͤͯ͞΋্ख͘நग़ग़དྷͳ͍
    • ઐ໳༻ޠͷநग़ʹ͸ઐ໳Ոͷଟ͘ͷ࣌ؒͱਓख͕ඞཁ
    ͱ͍͏໰୊͕͋Γɺઐ໳༻ޠͷநग़͸೉͔ͬͨ͠
    5
    ઐ໳༻ޠͷࣙॻ͸ܗଶૉղੳ΍ݕࡧͷਫ਼౓Λ޲্ͤ͞Δ

    View Slide

  6. എܠ
    ͦͷͨΊগͳ͍ίετͰઐ໳༻ޠநग़͕ՄೳʹͳΔ͜ͱ͸

    ϨτϦό੡඼ͷੑೳ޲্ʹܨ͕Δ
    6

    View Slide

  7. ఏҊख๏
    • ग़ݱස౓ͱ࿈઀ස౓ʹΑΔઐ໳༻ޠͷީิநग़
    • ೳಈֶशΛ༻͍ͨڭࢣ͋ΓֶशʹΑΔઐ໳༻ޠީิͷ෼ྨ
    7
    ͜ΕΒ2ͭͷख๏Λ૊Έ߹ΘͤΔ͜ͱͰ
    ௿ίετͰͷઐ໳༻ޠநग़ΛՄೳʹ͢Δ

    View Slide

  8. ઐ໳༻ޠநग़·Ͱͷϑϩʔ
    8
    ग़ݱස౓ͱ࿈઀ස౓ʹΑΔઐ໳༻ޠީิநग़
    ೳಈֶशΛ༻͍ͨڭࢣ͋Γֶश
    ઐ໳༻ޠͷநग़

    View Slide

  9. ग़ݱස౓ͱ࿈઀ස౓ʹΑΔઐ໳༻ޠީิநग़[த઒+ 2003]
    • ઐ໳༻ޠ͸໊ࢺͦͷ΋ͷ͔ෳ਺ͷ໊ࢺͷෳ߹ޠ͔Β੒ΔͱԾఆ
    • ෳ߹ޠΛߏ੒͢Δ࠷খ୯ҐΛ୯໊ࢺͱఆٛ
    • ͋Δ୯໊ࢺ͕ଞͷ୯໊ࢺͱ࿈݁ͯ͠

    ෳ߹ޠΛ࡞Δճ਺͕ଟ͍΄Ͳॏཁ㱺ઐ໳༻ޠ
    9
    ࣗવݴޠॲཧ ࣗવ ݴޠ ॲཧ
    = +
    +

    View Slide

  10. ग़ݱස౓ͱ࿈઀ස౓ʹΑΔઐ໳༻ޠީิநग़[த઒+ 2003]
    ྫ: ࣗવݴޠॲཧ
    10
    ୯໊ࢺ લͷޠʹ࿈݁ͨ͠ճ਺ ޙͷޠʹ࿈݁ͨ͠ճ਺
    ࣗવ
    ݴޠ
    ॲཧ
    ॏཁ౓ = ෳ߹ޠΛ࡞Δ୯໊ࢺͷ࿈݁ճ਺ͷ૬৐ฏۉ

    = 6 1 ⋅ 2 ⋅ 2 ⋅ 3 ⋅ 1 ⋅ 1 = 1.51

    View Slide

  11. ઐ໳༻ޠநग़·Ͱͷϑϩʔ
    11
    ग़ݱස౓ͱ࿈઀ස౓ʹΑΔઐ໳༻ޠީิநग़
    ೳಈֶशΛ༻͍ͨڭࢣ͋Γֶश
    ઐ໳༻ޠͷநग़

    View Slide

  12. ೳಈֶशͱ͸
    ୔ࢁͷϥϕϧͳ͠σʔλͷத͔Β

    ϥϕϧ͕෇͘ͱϞσϧͷੑೳ͕޲্ͦ͠͏ͳσʔλΛϢʔβʹਪન͠

    Ξϊςʔγϣϯ͍ͯ͘͜͠ͱͰϞσϧΛֶश͍ͯ͘͠ํ๏
    12
    গͳ͍ڭࢣσʔλͰϞσϧͷੑೳ͕޲্͢Δ

    View Slide

  13. ೳಈֶशͱ͸
    13
    ઐ໳༻ޠ
    ඇઐ໳༻ޠ
    ϥϕϧͳ͠
    1
    2
    ϥϕϧ͕஌Γ͍ͨσʔλ͸?

    View Slide

  14. ೳಈֶशͱ͸
    14
    ઐ໳༻ޠ
    ඇઐ໳༻ޠ
    ϥϕϧͳ͠
    1
    2
    ϥϕϧ͕஌Γ͍ͨσʔλ͸?
    ޮՌతͳֶश͕ߦ͑ͳ͍

    View Slide

  15. ೳಈֶशͱ͸
    15
    ઐ໳༻ޠ
    ඇઐ໳༻ޠ
    ϥϕϧͳ͠
    1
    2
    ϥϕϧ͕஌Γ͍ͨσʔλ͸?
    ֶश͕ޮՌతʹਐΉ

    View Slide

  16. ಛ௃ྔϕΫτϧͷ࡞੒
    • લޙೋ୯ޠͷද૚ܥͱ඼ࢺͱจࣈछ
    • ڭࢣͳֶ͠शʹΑΔॏཁ౓
    ͔Βಛ௃ྔϕΫτϧΛ࡞੒͢Δ
    16
    ݚڀ ͸ ࣗવݴޠॲཧ ͱ ػց ֶश Ͱ͢
    ໊ ॿ ઐ໳༻ޠީิ ॿ ໊ ໊ ॿಈ
    ݚڀ ͸ ࣗવݴޠॲཧ ͱ ػց ֶश Ͱ͢
    1.51
    ݚڀ ͸ ࣗવݴޠॲཧ ͱ ػց ֶश Ͱ͢

    View Slide

  17. Ϟσϧͷֶश
    Logistic regression
    • ͦͷ୯ޠ͕ઐ໳༻ޠ͔ඇઐ໳༻ޠ͔Λ෼ྨ͢ΔϞσϧ
    • ೳಈֶशͰ͸ֶशͱ༧ଌΛ܁Γฦ͢ҝ୯७ͳϞσϧΛ࠾༻
    • ࠓճ༻͍Δೳಈֶशͷख๏Ͱ͸༧ଌ֬཰͕ඞཁ
    17

    View Slide

  18. σʔλબ୒ͱϞσϧͷߋ৽
    Uncertainly Sampling (least confident)
    ݱ࣌఺ͷϞσϧͰ࠷΋ෆ͔֬ͳσʔλΛਪન
    18
    x*
    LC
    = arg max
    x∈U
    1 − Pθ
    ( ̂
    y|x)
    ̂
    y: ࠷΋औΓ͏Δ֬཰͕ߴ͍ϥϕϧ
    U : ϥϕϧͳ͠σʔλͷू߹
    x*
    LC
    : ϥϕϧ෇͚Λਪન͢Δσʔλ

    View Slide

  19. ࣮ݧᶃ: Wikipediaʹରͯ͠ઐ໳༻ޠநग़
    • σʔλ
    • Wikipediaͷจষ61ͭʹରͯ͠ઐ໳༻ޠͷநग़Λߦ͏
    • ৚݅ઃఆ
    • ڭࢣͳֶ͠शͰநग़ͨ͠༻ޠͷࡾ෼ͷҰʹΞϊςʔγϣϯ
    • 5ͭͷσʔλʹϥϕϦϯά͕ऴΘͬͨΒϞσϧΛ࠶ֶश
    • ೳಈֶशͱϥϯμϜαϯϓϦϯάɺࣙॻʹΑΔൺֱΛߦ͏
    19
    ೳಈֶश͕ϥϯμϜαϯϓϦϯάΑΓ༏Ε͍ͯΔ͜ͱΛࣔ͢

    View Slide

  20. ࣮ݧᶃ: ݁Ռ
    IPAdic NEologd
    20
    Ϟσϧ 1SFDJTJPO 3FDBMM 'WBMVF
    ڭࢣͳֶ͠श
    ϥϯμϜαϯϓϦϯά
    ೳಈֶश
    Ϟσϧ 1SFDJTJPO 3FDBMM 'WBMVF
    ڭࢣͳֶ͠श
    ϥϯμϜαϯϓϦϯά
    ೳಈֶश
    • ྆ࣙॻʹ͓͍ͯϥϯμϜαϯϓϦϯάΑΓೳಈֶश͕༏Ε͍ͯͨ
    • NEologdΛ࢖༻ͨ͠΄͏͕ੑೳ͕ߴ͔ͬͨ

    View Slide

  21. ࣮ݧᶄ: FAQυϝΠϯʹରͯ͠ͷઐ໳༻ޠநग़
    • ֶशσʔλ
    • εΧύʔʂͷϔϧϓίϯςϯπ͔Βऔಘͨ͠FAQ 5,113จࣈ
    • ৚݅ઃఆ
    • ϥϯμϜʹΞϊςʔγϣϯ͢ΔϞσϧͱൺֱ
    • 5ͭͷσʔλʹϥϕϦϯά͕ऴΘͬͨΒϞσϧΛ࠶ֶश
    • Ξϊςʔγϣϯ਺͕0ͷͱ͖͸શͯͷநग़୯ޠΛઐ໳༻ޠͱΈͳ͢
    21
    Ͳͷఔ౓Ξϊςʔγϣϯ͢Ε͹࣮༻తͳϞσϧʹͳΔ͔֬ೝ
    IUUQTIFMQDFOUFSTLZQFSGFDUWDPKQ

    View Slide

  22. ࣮ݧᶄ: ਫ਼౓ͱ࠶ݱ཰
    22
    • ਫ਼౓͸ೳಈֶश͕ϥϯμϜαϯϓϦϯάΑΓઌʹανΔ
    • ࠶ݱ཰Ͱೳಈֶश͸ϥϯμϜαϯϓϦϯάΛେ্͖͘ճΔ
    Ξϊςʔγϣϯͳͩ͠ͱ
    ਫ਼౓͸௿͍
    ڭࢣͳֶ͠श
    ઐ໳༻ޠͷ72.7%ΛΧόʔ
    ڭࢣͳֶ͠श

    View Slide

  23. ࣮ݧᶄ: F஋
    23
    ׂ࢛ఔ౓ΞϊςʔγϣϯΛߦ͏͚ͩͰF஋͸7ׂΛ௒͑ͨ
    ࠷େͰ໿20ϙΠϯτͷࠩ

    View Slide

  24. நग़ʹ੒ޭͨ͠ઐ໳༻ޠ
    • εΧύʔʂɺϓϨϛΞϜαʔϏεޫϚϯγϣϯ޲͚αʔϏε
    நग़ग़དྷͳ͔ͬͨઐ໳༻ޠ
    • TZ-WR4KPɺSP-HR200HɺΞϯςφαϙʔτϓϥϯ
    ؒҧͬͯநग़ͯ͠͠·ͬͨ୯ޠ
    • ൪૊ɺνϟϯωϧɺMyνϟϯωϧ1
    ࣮ݧᶄ: ڭࢣͳֶ͠शͰͷநग़୯ޠྫ
    24

    View Slide

  25. ΠϯλʔϑΣʔε
    Ξϊςʔγϣϯͷޮ཰Λ্͛ΔͨΊʹ

    WebΞϓϦέʔγϣϯͱͯ͠ΠϯλʔϑΣʔεΛ։ൃͨ͠
    25
    ػೳҰཡ
    • ઐ໳༻ޠͷϋΠϥΠτ / நग़ػೳ
    • ೳಈֶशʹΑΔֶशͱΞϊςʔγϣϯσʔλͷਪન
    • CSVΤΫεϙʔτ

    View Slide

  26. DEMO
    26

    View Slide

  27. ΞϓϦέʔγϣϯͷߏ੒
    27

    View Slide

  28. ·ͱΊ
    28
    ໨త
    গͳ͍ςΩετσʔλ͔Βઐ໳༻ޠͷநग़Λߦ͏
    ख๏
    ڭࢣͳֶ͠श+ೳಈֶशΛ༻͍ͨWebΞϓϦέʔγϣϯͷఏڙ
    ࠓޙ
    நग़ΞϧΰϦζϜͷ࠶࣮૷ʹΑΔߴ଎Խ
    ݕࡧͳͲͷԠ༻ʹ͓͚ΔੑೳධՁɺ৽ͨͳख๏ɾಛ௃ྔͷௐࠪ

    View Slide

  29. ࢀߟจݙ
    [1] த઒ ༟ࢤ, ౬ຊ ߛজ, ৿ ୢଇ. ग़ݱස౓ͱ࿈઀ස౓ʹجͮ͘ઐ໳༻ޠநग़. ࣗવݴޠॲཧ. 2003, 10(1),
    p.27-45.
    [2] த઒ ༟ࢤ, ౬ຊ ߛজ, ৿ ୢଇ. ೔ຊޠϚχϡΞϧจʹ͓͚Δ໊ࢺؒͷ࿈઀৘ใΛ༻͍ͨϋΠύʔςΩε
    τԽͷͨΊͷࡧҾޠͷநग़. ৘ใॲཧֶձݚڀใࠂࣗવݴޠॲཧ. 1996, (114), p.65-72
    [3] “ઐ໳༻ޠʢΩʔϫʔυʣࣗಈநग़༻PerlϞδϡʔϧ ”. ”ઐ໳༻ޠʢΩʔϫʔυʣࣗಈநग़γεςϜ”ͷ
    ϖʔδ΁Α͏ͦ͜. http://gensen.dl.itc.u-tokyo.ac.jp/termextract.html, (ࢀর 2018-9-4).
    [4] Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648. 2010.
    http://burrsettles.com/pub/settles.activelearning.pdf, (ࢀর 2018-9-4).
    [5] Burr Settles, Mark Craven. An Analysis of Active Learning Strategies for Sequence Labeling Tasks.
    EMNLP. 2008.
    29

    View Slide