Upgrade to Pro — share decks privately, control downloads, hide ads and more …

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

uchi_k
August 17, 2020

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

ACL2020 に採択された A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks という論文を読んでいます。特に感情認識系のタスクにおいて前処理が単語埋め込みに与える影響を調べ、よく行われる実験設定が本当に正しいのかを検証しています。

uchi_k

August 17, 2020
Tweet

More Decks by uchi_k

Other Decks in Programming

Transcript

  1. 15min LT:
    A Comprehensive Analysis of Preprocessing for
    Word Representation Learning in Affective
    Tasks

    View Slide

  2. ಺ڮ ݎࢤ
    uchi_k @__uchi_k__
    About me
    yuni, inc. ୅ද
    nlpaper.challenge ӡӦ
    Freelance Machine Learning
    ɹɹɹɹɹEngineer / Researcher
    former ژେ৘ใӃ, ະ౿16
    FreakOut Machine Learning Engineer

    View Slide

  3. About yuni
    େاۀ͔ΒελʔτΞοϓɺݚڀػؔͳͲ͔Βػցֶशؔ࿈ͷडୗ։ൃ
    Λߦ͖ͬͯ·ͨ͠
    σʔλυϦϒϯͳ΋ͷͮ͘Γࣄۀͱͯ͠ɺΦϯϥΠϯ໰਍ʹΑΔύʔι
    φϥΠζ৸۩ͷ੡࡞΋΍͍ͬͯͨΓ͠·͢
    ࡢ೥຤૑ۀͨ͠ɺγʔυظʹ͋ΔελʔτΞοϓʢࣾһ໊ʣͰ͢
    ػցֶशºϚʔέςΟϯάྖҬͰαʔϏε։ൃΛ͍ͯ͠·͢
    6($ͷղੳ͕ओͳ࢓ࣄͰɺࠓ೔ͷ࿦จ΋ͦΕʹؔ܎ͨ͠ײ৘ղੳʹͭ
    ͍ͯ

    View Slide

  4. #distributional hypothesis #word embedding
    ෼෍Ծઆʹجͮ͘୯ޠຒΊࠐΈͷݶք
    ʮ޾ͤʯͱʮ൵͠ΈʯͷϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓
    ͕ߴ͘ͳΔɺͳͲ௚ײʹ൓͢Δྨࣅ౓͕ಘΒΕΔ͜ͱ΋͋ΓɺλεΫ
    ͝ͱʹ୯ޠຒΊࠐΈΛௐ੔͢Δඞཁ͕͋Δ
    The Distributional Hypothesis is that words that occur in the
    same contexts tends to have similar meanings [Harris, 1954].
    ࣅͨจ຺Ͱසൟʹग़ݱ͢Δ୯ޠಉ࢜͸ҙຯతʹྨࣅ͍ͯ͠Δͱߟ͑ͯɺ
    ຒΊࠐΈۭؒͰ΋ۙ͘ͳΔͱ͍͏Ծઆ
    ୯ޠͷҙຯΛܾΊΔͨΊͷҰͭͷํ๏ͱͯ͠ɺ෼෍Ծઆ͕͋Δɻ
    ౷ܭతʹ୯ޠͷҙຯΛಘΔͨΊͷํ๏ͰɺXPSEWFDͷΑ͏ͳਪ࿦
    ϕʔεͷϞσϧ΍୯ʹ౷ܭ৘ใΛ࣍ݩ࡟ݮ͢ΔΧ΢ϯτϕʔεͷख๏΋
    ͋Δ

    View Slide

  5. "$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH
    GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT
    #abstract
    /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF
    FUBM "$-
    ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ
    Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ
    ֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͍͕͚ͪͩͲɺྫ͑͹ʮ޾ͤʯͱʮ൵͠Έʯͷ
    ϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓͕ߴ͘ͳΔΑ͏ͳຒΊࠐΈ͕ଘ
    ࡏ͢Δͷʹײ৘ೝ͕ࣝຊ౰ʹղ͚Δʁ
    4UPQXPSET OFHBUJPO 104 MFNNBUJ[BUJPOͳͲͷલॲཧΛͲ͏࢖͏͔
    ͕ຊ࣭తʹॏཁͳͷͰ͸ʁ
    લॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹͷେ͖͞Λݕূ͠ɺैདྷͷ࣮ݧઃఆͷݟ
    ௚͠Λߦ͍͍ͨ

    View Slide

  6. ؔ࿈ݚڀʢXPSEWFD 6($ʣ
    • *NQSPWJOH%JTUSJCVUJPOBM4JNJMBSJUZXJUI-FTTPOTGSPN8PSE
    &NCFEEJOHT
    ◦ 0NFS-FWZ #BS*MBO6OJWFSTJUZ
    FUBM "$-
    ◦ 8PSEFNCFEEJOHʹ͓͍ͯɺΧ΢ϯτϕʔεͷख๏Ͱ΋ϋΠύʔύϥϝʔ
    λௐ੔࣍ୈͰXPSEWFDͳͲͷਪ࿦ϕʔεͷख๏ʹউͯΔ͜ͱΛࣔͨ͠
    • /-'**5BU*&45&NPUJPO3FDPHOJUJPOVUJMJ[JOH/FVSBM
    /FUXPSLTBOE.VMUJMFWFM1SFQSPDFTTJOH
    ◦ 4BNVFM1FDBS 4MPWBL6OJWFSTJUZPG5FDIOPMPHZ
    FUBM &./-1
    ◦ 6TFSHFOFSBUFEDPOUFOUTΛ࢖༻͢Δ৔߹ͷલॲཧͷॏཁੑʹ͍ͭͯௐ΂
    ͍ͯΔɻಛʹإจࣈ΍ֆจࣈͷೝࣝΛৄ͘͠ߦ͍είΞΛ্͛Δ͜ͱʹ੒ޭ
    #recent study #ugc #word2vec

    View Slide

  7. ؔ࿈ݚڀʢલॲཧʣ
    • 0OTUPQXPSET pMUFSJOHBOEEBUBTQBSTJUZGPSTFOUJNFOU
    BOBMZTJTPGUXJUUFS
    ◦ )BTTBO4BJG ,OPXMFEHF.FEJB*OTUJUVUF 5IF0QFO6OJWFSTJUZ
    FUBM
    -3&$
    ◦ ετοϓϫʔυͷআڈ͕༗ޮ͔ͦ͏Ͱͳ͍͔͸ϫʔυϦετͷ࡞Γํ΍λε
    ΫͰେ͖͘ҟͳΔ͕ɺUXJUUFSTFOUJNFOUͰ͸Ұൠతͳํ๏ͩͱ֐ͷํ͕େ
    ͖͍͜ͱΛࣔͨ͠
    • "DPNQBSBUJWFFWBMVBUJPOPGQSFQSPDFTTJOHUFDIOJRVFTBOE
    UIFJSJOUFSBDUJPOTGPSUXJUUFSTFOUJNFOUBOBMZTJT
    ◦ 4ZNFPO4ZNFPOJEJT &YQFSU4ZTUFNTXJUI"QQMJDBUJPOT
    ◦ લॲཧͷςΫχοΫΛ৭ʑࢼͯ͠ΈͨΒɺײ৘෼ੳͰ͸MFNNBUJ[BUJPOͱ
    ਺ࣈͷআڈɺ୹ॖܗͷஔ׵͕࠷΋είΞʹد༩
    #recent study #preprocessing #emotion

    View Slide

  8. "$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH
    GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT
    #abstract
    λεΫݻ༗ͷඍௐ੔΍Ϟσϧͷվળ΋ॏཁͰ͸͋Δ͕ɺઌߦݚڀ͔Β͸લॲ
    ཧ΍ϋΠύʔύϥϝʔλͷӨڹ͕ແࢹͰ͖ͳ͍͜ͱ͕ಡΈऔΕΔ
    ֶश༻σʔλͷ୯ޠຒΊࠐΈΛߦ͏લɾޙͦΕͧΕͷλΠϛϯάͰલॲཧΛ
    ߦͬͨΓɺςετσʔλͷલॲཧͱ߹ΘͤͨΓ߹Θͤͳ͔ͬͨΓΛࢼ͢
    /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF
    FUBM "$-
    ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ
    Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ

    View Slide

  9. #key points
    ͜ͷ࿦จΛ঺հ͢Δཧ༝
    ໘ന͍৽نख๏΋ͨ͘͞Μ͋Δ͕ɺ࣮ӡ༻Ͱਫ਼౓͕ग़ͤΔ΋ͷ͕ͳ͔
    ͳ͔ͳ͍ͱײ͍ͯͨ͡
    ݁ہલॲཧͷબͼํ΍ख๏ͷҧ͍͕େ͖͘είΞʹӨڹ͍ͯ͠Δ͕ɺ
    ࿦จͰͦΕΛ࿦͍ͯ͡Δ΋ͷ͕΄ͱΜͲͳ͍
    ҉໧஌తͳલॲཧͷ஌ࣝΛ·ͱΊΔ͍͍ػձʹͳΕ͹͍͍͔ͳͱࢥͬ
    ͨ

    View Slide

  10. #key points
    ΍ͬͨ͜ͱ
    લॲཧΛ୯ޠຒΊࠐΈʹ౷߹͢ΔͱͲΜͳޮՌ͕͋Δ͔ʁ
    Ͳͷલॲཧ͕ײ৘෼ੳܥͷλεΫʹޮՌ͕͋Δͷ͔ʁ
    ࣄલֶश͞Εͨ΋ͷΑΓվળ͞Ε͍ͯΔ͔ʁ
    ͭͷֶशσʔλɺͭͷςετσʔλΛ࢖༻ͨ͠ײ৘ܥλεΫͰɺֶ
    शσʔλɺ෼ྨσʔλɺ྆ํɺͦΕͧΕʹલॲཧΛద༻ͨ͠৔߹Ͱൺ
    ֱ
    ݕূͨ͜͠ͱ

    View Slide

  11. #preprocessing #pipeline
    /-1ʹ͓͚ΔલॲཧͷྲྀΕ
    ΫϦʔχϯά
    ෼ׂ
    ਖ਼نԽ
    ѹॖ
    ϕΫτϧԽ
    λά ه߸ͳͲͷআڈ QVODUVBUJPO
    ܗଶૉղੳ ࣙॻͷ௥Ճ ܎Γड͚ղੳ
    ਺ࣈͷஔ͖׵͑ إจࣈͳͲͷೝࣝ TQFMMDIFDL
    දهΏΕ MPXFSDBTJOH ୅දޠ΁ͷஔ͖׵͑ লུޠ
    MFNNBUJ[BUJPO TUFNNJOH OFHBUJPO Φϯτϩδʔ
    4UPQXPSEͷআڈ 104
    $#08 TLJQHSBN #&35
    DPWFSBHFͷௐࠪ ෼ྨσʔλͱޠኮΛ͚ۙͮΔ FUD

    View Slide

  12. #preprocessing #negation
    /FHBUJPO
    • ൓ҙޠࣙॻͷ࡞੒
    ◦8PSE/FUίʔύεͰ൓ҙޠࣙॻΛ࡞੒
    ◦൓ҙޠ͕ݟ͔ͭΒͳ͍PSͭͰ͋Ε͹ͦͷ··ɺෳ਺͋Δ৔߹͸
    VL8BDίʔύεͷதͰ࠷େͷස౓Λ࣋ͭ൓ҙޠͱͨ͠Γ୯ʹϥϯμϜ
    ʹબ୒ͨ͠Γ
    • ൱ఆޠͷ൓ҙޠ΁ͷஔ׵
    ◦൱ఆޠ͕ݟ͔ͭͬͨ৔߹ɺଓ͘୯ޠΛநग़͠ɺ൓ҙޠࣙॻͰ൓ҙޠΛ
    ݕࡧɻ൓ҙޠ͕ݟ͔ͭͬͨ৔߹ɺ൱ఆޠͱ൱ఆ͞ΕͨޠΛͦΕʹஔ͖
    ׵͑Δ
    ◦ྫ͑͹ɺͱ͍͏จͰ͸ɺ൱ఆޠʢ`OPUʣ
    ͱͦΕʹରԠ͢Δ୯ޠʢIBQQZʣΛಛఆɻ൓ҙޠࣙॻͰbIBQQZ`ͷ൓
    ҙޠʢ`TBE`ʣΛ୳͠ɺOPUIBQQZ`ΛbTBE`ʹஔ͖׵͑Δ

    View Slide

  13. #corpus #training #dataset
    /FXT
    શମͱͯ͠ɺ4UPQXPSEͷআڈ΍104Ͱ͸WPDBCTJ[F͸͋·Γม
    ΘΒͳ͍͕DPSQVTTJ[F͕େ͖͘ݮগ
    ʙ೥ͷΞϝϦΧͷͷग़
    ൛෺͔Βͷ ݅ͷهࣄ
    8JLJQFEJB
    8JLJQFEJBͷهࣄ ݅Ͱ
    ߏ੒͞ΕΔɺ/FXTΑΓ໿ഒେ͖
    ͍ίʔύε
    5SBJOJOH$PSQVT
    ͭͷαΠζɾੑ࣭ͷҟͳΔίʔύεʹͭͷલॲཧΛߦ͏

    View Slide

  14. #corpus #evaluation #dataset
    &WBMVBUJOH$PSQVT
    4FOUJNFOUBOBMZTJT FNPUJPODMBTTJpDBUJPO
    TBSDBTNEFUFDUJPOͷͭͷλεΫͰධՁɻ
    • *.%#
    ◦ ݅ͷөըϨϏϡʔɻϙδωΨൺ
    • 4FN&WBM
    ◦ ໿πΠʔτɻϙδωΨൺ
    • "JSMJOF
    ◦ ߤۭձࣾࣾʹؔ͢Δ໿݅πΠʔτɻ
    4FOUJNFOUBOBMZTJTײ৘ϙδωΨ
    • *4&"3
    ◦ ໿݅ͷɺײ৘Λשى͢Δݸਓతͳ࿩
    • "MN
    ◦ ໿݅ͷ͓ͱ͗࿩
    • 44&$
    ◦ 4FN&WBMΛ࠶Ξϊςʔγϣϯͨ͠໿݅ͷπ
    Πʔτ
    &NPUJPO%FUFDUJPOײ৘Ϋϥε෼ྨ 4BSDBTN%FUFDUJPOൽ೑ͷݕग़
    • 0OJPO
    ◦ ൽ೑Λѻ͏ϝσΟΞͱͦ͏Ͱͳ͍ϝσΟΞ͔Βऩू
    ͨ͠໿݅ͷχϡʔεϔουϥΠϯ
    • *"$
    ◦ ໿݅ͷൃ࿩Ԡ౴
    • 3FEEJU
    ◦ ஶऀ͕ϥϕϧ෇͚ͨ͠໿ສ݅ͷ3FEEJU౤ߘ

    View Slide

  15. #result
    /FHBUJPO͕શͯͷσʔληοτʹ͓͍ͯ࠷΋ޮՌతͩͬͨ
    /FXTίʔύεʹલॲཧΛߦͬͨ͋ͱ୯ޠຒΊࠐΈΛ࡞੒ͨ͠৔߹ͷGTDPSF
    /FHBUJPOҎ֎ͷલॲཧͷෳ߹ΛؚΊͯ΋ɺOFHBUJPOͷΈͷ৔߹͕ৗʹ
    ൪໨ʹείΞ͕ߴ͔ͬͨ

    View Slide

  16. #result
    ʢҰൠతͳʣ4UPQXPSET TUFNNJOH͸ɺ୯ମͰ͸ҙຯ͕͋Γͦ͏ʹݟ͑
    ͯ΋ଞͷલॲཧͱಉ࣌ͩͱείΞʹد༩͍ͯ͠ͳ͍͜ͱ͕Θ͔Δ
    શͯͷલॲཧΛద༻ͯ͠΋OFHBUJPOͷΈͷ৔߹ͱมΘΒͳ͍͔গ͠Լ͕Δ
    ͘Β͍ʢ0OJPO 3FEEJU 44&$ʣ
    4UPQXPSET΍104͸ίʔύεαΠζΛେ͖͘ݮগͤ͞Δ͕ɺ104Ͱ͸
    είΞݮগ͕ͳ͍

    View Slide

  17. #result
    XJLJQFEJBDPSQVTΛ࢖ͬͨ$#08 4LJQHSBN #&35ͷ'TDPSFൺֱ
    8JLJQFEJBͰ΋ಉ༷ͷ܏޲͕ΑΓڧ·ͬͨ

    View Slide

  18. #result #preprocess #postprocess
    ֶशίʔύεʹલॲཧΛద༻͢Δ৔߹ʢQSFʣͱɺ
    ෼ྨσʔληοτʹલॲཧΛద༻͢Δ৔߹ʢQPTUʣͷൺֱ
    ௚ײ௨ΓɺQPTUͷΈ͕͍ͣΕͷ৔߹Ͱ΋࠷΋είΞ͕௿͘ͳͬͨ
    QSFͱCPUIͰείΞʹେ͖ͳ͕ࠩͳ͘ɺQSF͕࠷΋ॏཁͰ͋Δ͜ͱ͕ࣔ
    ͞Εͨ
    Ұൠతʹɺ୯ޠຒΊࠐΈ͕༩͑ΒΕͨ৔߹͸෼ྨσʔλΛ߹Θͤʹ͍͘͜ͱ
    ͕ଟ͍Α͏ʹࢥ͏ͷͰɺҙ֎ͳ݁Ռ

    View Slide

  19. #result #compare with SoTA
    4P5"ϕʔεϥΠϯʹର͢ΔఏҊϞσϧͷධՁ
    શͯͷλεΫͰఏҊख๏܈͕4P5"Λ্ճΔ
    #&35͕Ұ൪ڧ͍ͷ͸౰વͳͷͰগͣ͠Δ͍͕ɺఏҊख๏શମͱͯ͠উͬͯ
    ͍Δ΋ͷ͕ଟ͍ʢ*.%# *"$ 0OJPO 3FEEJU 44&$ʣ
    4P5"ͳࣄલֶशϞσϧ͸ఏҊख๏ΑΓང͔ʹେ͖͍ίʔύεΛ࢖͍ͬͯΔ
    ͷͰɺQSFͷॏཁੑ͕Θ͔Δ

    View Slide

  20. #result #relative improvement
    GTDPSFͷઈର஋ͱجຊతͳલॲཧ͔Βͷ૬ରతͳվળ
    TFOUJNFOUBOBMZTJTͱTBSDBTNEFUFDUJPOͷͭͷόΠφϦλεΫΑ
    ΓϚϧνΫϥε෼ྨλεΫͰͷվળ෯ͷ΄͏͕एׯେ͖͍
    ΑΓଟ͘ͷσʔληοτͰൺֱ͠ͳ͍ͱ·ͩ·ͩඍົͳࠩͰ͔͠ͳ͍Α͏ͳ
    ؾ͸͢Δ

    View Slide

  21. #key points
    ·ͱΊ
    ୯ޠຒΊࠐΈ࣌఺Ͱͷલॲཧ͕࠷΋λΠϛϯάͱͯ͠༗ޮͰ͋Δ͜ͱ
    ͕ࣔ͞Εͨ
    ୯ମͱͯ͠͸OFHBUJPO͕࠷΋ޮՌ͕͋ΓɺҰൠతͳTUPQXPSET
    ΍TUFNNJOH͸είΞΛԼ͛Δ͜ͱ͕ଟ͍
    ڊେͳίʔύεͰֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͏ΑΓɺλεΫʹ߹Θ
    ͤͨલॲཧΛద੾ͳλΠϛϯάͰߦ͏͜ͱͰείΞͰ্ճΕΔ
    ҉໧஌తʹ஌ΒΕ͍ͯͨ஌ݟ͕ଟ͍͕ɺ͔ͬ͠Γͨ͠ݕূΛߦ͏͜ͱ
    Ͱମܥతͳ஌ࣝʹͨ͠

    View Slide