Pro Yearly is on sale from $80 to $50! »

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

D79e3b0b78ca7027eb61d73028d45ad6?s=47 uchi_k
August 17, 2020

前処理が単語埋め込みに与える影響 A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

ACL2020 に採択された A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks という論文を読んでいます。特に感情認識系のタスクにおいて前処理が単語埋め込みに与える影響を調べ、よく行われる実験設定が本当に正しいのかを検証しています。

D79e3b0b78ca7027eb61d73028d45ad6?s=128

uchi_k

August 17, 2020
Tweet

Transcript

  1. 15min LT: A Comprehensive Analysis of Preprocessing for Word Representation

    Learning in Affective Tasks
  2. ಺ڮ ݎࢤ uchi_k @__uchi_k__ About me yuni, inc. ୅ද nlpaper.challenge

    ӡӦ Freelance Machine Learning ɹɹɹɹɹEngineer / Researcher former ژେ৘ใӃ, ະ౿16 FreakOut Machine Learning Engineer
  3. About yuni େاۀ͔ΒελʔτΞοϓɺݚڀػؔͳͲ͔Βػցֶशؔ࿈ͷडୗ։ൃ Λߦ͖ͬͯ·ͨ͠ σʔλυϦϒϯͳ΋ͷͮ͘Γࣄۀͱͯ͠ɺΦϯϥΠϯ໰਍ʹΑΔύʔι φϥΠζ৸۩ͷ੡࡞΋΍͍ͬͯͨΓ͠·͢ ࡢ೥຤૑ۀͨ͠ɺγʔυظʹ͋ΔελʔτΞοϓʢࣾһ໊ʣͰ͢ ػցֶशºϚʔέςΟϯάྖҬͰαʔϏε։ൃΛ͍ͯ͠·͢ 6($ͷղੳ͕ओͳ࢓ࣄͰɺࠓ೔ͷ࿦จ΋ͦΕʹؔ܎ͨ͠ײ৘ղੳʹͭ ͍ͯ

  4. #distributional hypothesis #word embedding ෼෍Ծઆʹجͮ͘୯ޠຒΊࠐΈͷݶք ʮ޾ͤʯͱʮ൵͠ΈʯͷϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓ ͕ߴ͘ͳΔɺͳͲ௚ײʹ൓͢Δྨࣅ౓͕ಘΒΕΔ͜ͱ΋͋ΓɺλεΫ ͝ͱʹ୯ޠຒΊࠐΈΛௐ੔͢Δඞཁ͕͋Δ The Distributional

    Hypothesis is that words that occur in the same contexts tends to have similar meanings [Harris, 1954]. ࣅͨจ຺Ͱසൟʹग़ݱ͢Δ୯ޠಉ࢜͸ҙຯతʹྨࣅ͍ͯ͠Δͱߟ͑ͯɺ ຒΊࠐΈۭؒͰ΋ۙ͘ͳΔͱ͍͏Ծઆ ୯ޠͷҙຯΛܾΊΔͨΊͷҰͭͷํ๏ͱͯ͠ɺ෼෍Ծઆ͕͋Δɻ ౷ܭతʹ୯ޠͷҙຯΛಘΔͨΊͷํ๏ͰɺXPSEWFDͷΑ͏ͳਪ࿦ ϕʔεͷϞσϧ΍୯ʹ౷ܭ৘ใΛ࣍ݩ࡟ݮ͢ΔΧ΢ϯτϕʔεͷख๏΋ ͋Δ
  5. "$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT #abstract /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF FUBM "$- ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ ֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͍͕͚ͪͩͲɺྫ͑͹ʮ޾ͤʯͱʮ൵͠Έʯͷ

    ϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓͕ߴ͘ͳΔΑ͏ͳຒΊࠐΈ͕ଘ ࡏ͢Δͷʹײ৘ೝ͕ࣝຊ౰ʹղ͚Δʁ 4UPQXPSET OFHBUJPO 104 MFNNBUJ[BUJPOͳͲͷલॲཧΛͲ͏࢖͏͔ ͕ຊ࣭తʹॏཁͳͷͰ͸ʁ લॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹͷେ͖͞Λݕূ͠ɺैདྷͷ࣮ݧઃఆͷݟ ௚͠Λߦ͍͍ͨ
  6. ؔ࿈ݚڀʢXPSEWFD 6($ʣ • *NQSPWJOH%JTUSJCVUJPOBM4JNJMBSJUZXJUI-FTTPOTGSPN8PSE &NCFEEJOHT ◦ 0NFS-FWZ #BS*MBO6OJWFSTJUZ FUBM "$-

    ◦ 8PSEFNCFEEJOHʹ͓͍ͯɺΧ΢ϯτϕʔεͷख๏Ͱ΋ϋΠύʔύϥϝʔ λௐ੔࣍ୈͰXPSEWFDͳͲͷਪ࿦ϕʔεͷख๏ʹউͯΔ͜ͱΛࣔͨ͠ • /-'**5BU*&45&NPUJPO3FDPHOJUJPOVUJMJ[JOH/FVSBM /FUXPSLTBOE.VMUJMFWFM1SFQSPDFTTJOH ◦ 4BNVFM1FDBS 4MPWBL6OJWFSTJUZPG5FDIOPMPHZ FUBM &./-1 ◦ 6TFSHFOFSBUFEDPOUFOUTΛ࢖༻͢Δ৔߹ͷલॲཧͷॏཁੑʹ͍ͭͯௐ΂ ͍ͯΔɻಛʹإจࣈ΍ֆจࣈͷೝࣝΛৄ͘͠ߦ͍είΞΛ্͛Δ͜ͱʹ੒ޭ #recent study #ugc #word2vec
  7. ؔ࿈ݚڀʢલॲཧʣ • 0OTUPQXPSET pMUFSJOHBOEEBUBTQBSTJUZGPSTFOUJNFOU BOBMZTJTPGUXJUUFS ◦ )BTTBO4BJG ,OPXMFEHF.FEJB*OTUJUVUF 5IF0QFO6OJWFSTJUZ FUBM

     -3&$ ◦ ετοϓϫʔυͷআڈ͕༗ޮ͔ͦ͏Ͱͳ͍͔͸ϫʔυϦετͷ࡞Γํ΍λε ΫͰେ͖͘ҟͳΔ͕ɺUXJUUFSTFOUJNFOUͰ͸Ұൠతͳํ๏ͩͱ֐ͷํ͕େ ͖͍͜ͱΛࣔͨ͠ • "DPNQBSBUJWFFWBMVBUJPOPGQSFQSPDFTTJOHUFDIOJRVFTBOE UIFJSJOUFSBDUJPOTGPSUXJUUFSTFOUJNFOUBOBMZTJT ◦ 4ZNFPO4ZNFPOJEJT &YQFSU4ZTUFNTXJUI"QQMJDBUJPOT ◦ લॲཧͷςΫχοΫΛ৭ʑࢼͯ͠ΈͨΒɺײ৘෼ੳͰ͸MFNNBUJ[BUJPOͱ ਺ࣈͷআڈɺ୹ॖܗͷஔ׵͕࠷΋είΞʹد༩ #recent study #preprocessing #emotion
  8. "$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT #abstract λεΫݻ༗ͷඍௐ੔΍Ϟσϧͷվળ΋ॏཁͰ͸͋Δ͕ɺઌߦݚڀ͔Β͸લॲ ཧ΍ϋΠύʔύϥϝʔλͷӨڹ͕ແࢹͰ͖ͳ͍͜ͱ͕ಡΈऔΕΔ ֶश༻σʔλͷ୯ޠຒΊࠐΈΛߦ͏લɾޙͦΕͧΕͷλΠϛϯάͰલॲཧΛ ߦͬͨΓɺςετσʔλͷલॲཧͱ߹ΘͤͨΓ߹Θͤͳ͔ͬͨΓΛࢼ͢ /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF FUBM

    "$- ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ
  9. #key points ͜ͷ࿦จΛ঺հ͢Δཧ༝ ໘ന͍৽نख๏΋ͨ͘͞Μ͋Δ͕ɺ࣮ӡ༻Ͱਫ਼౓͕ग़ͤΔ΋ͷ͕ͳ͔ ͳ͔ͳ͍ͱײ͍ͯͨ͡ ݁ہલॲཧͷબͼํ΍ख๏ͷҧ͍͕େ͖͘είΞʹӨڹ͍ͯ͠Δ͕ɺ ࿦จͰͦΕΛ࿦͍ͯ͡Δ΋ͷ͕΄ͱΜͲͳ͍ ҉໧஌తͳલॲཧͷ஌ࣝΛ·ͱΊΔ͍͍ػձʹͳΕ͹͍͍͔ͳͱࢥͬ ͨ

  10. #key points ΍ͬͨ͜ͱ લॲཧΛ୯ޠຒΊࠐΈʹ౷߹͢ΔͱͲΜͳޮՌ͕͋Δ͔ʁ Ͳͷલॲཧ͕ײ৘෼ੳܥͷλεΫʹޮՌ͕͋Δͷ͔ʁ ࣄલֶश͞Εͨ΋ͷΑΓվળ͞Ε͍ͯΔ͔ʁ ͭͷֶशσʔλɺͭͷςετσʔλΛ࢖༻ͨ͠ײ৘ܥλεΫͰɺֶ शσʔλɺ෼ྨσʔλɺ྆ํɺͦΕͧΕʹલॲཧΛద༻ͨ͠৔߹Ͱൺ ֱ ݕূͨ͜͠ͱ

  11. #preprocessing #pipeline /-1ʹ͓͚ΔલॲཧͷྲྀΕ ΫϦʔχϯά ෼ׂ ਖ਼نԽ ѹॖ ϕΫτϧԽ λά ه߸ͳͲͷআڈ

    QVODUVBUJPO ܗଶૉղੳ ࣙॻͷ௥Ճ ܎Γड͚ղੳ ਺ࣈͷஔ͖׵͑ إจࣈͳͲͷೝࣝ TQFMMDIFDL  දهΏΕ MPXFSDBTJOH ୅දޠ΁ͷஔ͖׵͑ লུޠ  MFNNBUJ[BUJPO TUFNNJOH OFHBUJPO Φϯτϩδʔ 4UPQXPSEͷআڈ 104 $#08 TLJQHSBN #&35 DPWFSBHFͷௐࠪ ෼ྨσʔλͱޠኮΛ͚ۙͮΔ FUD
  12. #preprocessing #negation /FHBUJPO • ൓ҙޠࣙॻͷ࡞੒ ◦8PSE/FUίʔύεͰ൓ҙޠࣙॻΛ࡞੒ ◦൓ҙޠ͕ݟ͔ͭΒͳ͍PSͭͰ͋Ε͹ͦͷ··ɺෳ਺͋Δ৔߹͸ VL8BDίʔύεͷதͰ࠷େͷස౓Λ࣋ͭ൓ҙޠͱͨ͠Γ୯ʹϥϯμϜ ʹબ୒ͨ͠Γ •

    ൱ఆޠͷ൓ҙޠ΁ͷஔ׵ ◦൱ఆޠ͕ݟ͔ͭͬͨ৔߹ɺଓ͘୯ޠΛநग़͠ɺ൓ҙޠࣙॻͰ൓ҙޠΛ ݕࡧɻ൓ҙޠ͕ݟ͔ͭͬͨ৔߹ɺ൱ఆޠͱ൱ఆ͞ΕͨޠΛͦΕʹஔ͖ ׵͑Δ ◦ྫ͑͹ɺ<b* BN OPU IBQQZ bUPEBZ`>ͱ͍͏จͰ͸ɺ൱ఆޠʢ`OPUʣ ͱͦΕʹରԠ͢Δ୯ޠʢIBQQZʣΛಛఆɻ൓ҙޠࣙॻͰbIBQQZ`ͷ൓ ҙޠʢ`TBE`ʣΛ୳͠ɺOPUIBQQZ`ΛbTBE`ʹஔ͖׵͑Δ
  13. #corpus #training #dataset /FXT શମͱͯ͠ɺ4UPQXPSEͷআڈ΍104Ͱ͸WPDBCTJ[F͸͋·Γม ΘΒͳ͍͕DPSQVTTJ[F͕େ͖͘ݮগ ʙ೥ͷΞϝϦΧͷͷग़ ൛෺͔Βͷ ݅ͷهࣄ 8JLJQFEJB

    8JLJQFEJBͷهࣄ  ݅Ͱ ߏ੒͞ΕΔɺ/FXTΑΓ໿ഒେ͖ ͍ίʔύε 5SBJOJOH$PSQVT ͭͷαΠζɾੑ࣭ͷҟͳΔίʔύεʹͭͷલॲཧΛߦ͏
  14. #corpus #evaluation #dataset &WBMVBUJOH$PSQVT 4FOUJNFOUBOBMZTJT FNPUJPODMBTTJpDBUJPO  TBSDBTNEFUFDUJPOͷͭͷλεΫͰධՁɻ • *.%#

    ◦ ݅ͷөըϨϏϡʔɻϙδωΨൺ • 4FN&WBM ◦ ໿πΠʔτɻϙδωΨൺ • "JSMJOF ◦ ߤۭձࣾࣾʹؔ͢Δ໿݅πΠʔτɻ 4FOUJNFOUBOBMZTJTײ৘ϙδωΨ • *4&"3 ◦ ໿݅ͷɺײ৘Λשى͢Δݸਓతͳ࿩ • "MN ◦ ໿݅ͷ͓ͱ͗࿩ • 44&$ ◦ 4FN&WBMΛ࠶Ξϊςʔγϣϯͨ͠໿݅ͷπ Πʔτ &NPUJPO%FUFDUJPOײ৘Ϋϥε෼ྨ 4BSDBTN%FUFDUJPOൽ೑ͷݕग़ • 0OJPO ◦ ൽ೑Λѻ͏ϝσΟΞͱͦ͏Ͱͳ͍ϝσΟΞ͔Βऩू ͨ͠໿݅ͷχϡʔεϔουϥΠϯ • *"$ ◦ ໿݅ͷൃ࿩Ԡ౴ • 3FEEJU ◦ ஶऀ͕ϥϕϧ෇͚ͨ͠໿ສ݅ͷ3FEEJU౤ߘ
  15. #result /FHBUJPO͕શͯͷσʔληοτʹ͓͍ͯ࠷΋ޮՌతͩͬͨ /FXTίʔύεʹલॲཧΛߦͬͨ͋ͱ୯ޠຒΊࠐΈΛ࡞੒ͨ͠৔߹ͷGTDPSF /FHBUJPOҎ֎ͷલॲཧͷෳ߹ΛؚΊͯ΋ɺOFHBUJPOͷΈͷ৔߹͕ৗʹ ൪໨ʹείΞ͕ߴ͔ͬͨ

  16. #result ʢҰൠతͳʣ4UPQXPSET TUFNNJOH͸ɺ୯ମͰ͸ҙຯ͕͋Γͦ͏ʹݟ͑ ͯ΋ଞͷલॲཧͱಉ࣌ͩͱείΞʹد༩͍ͯ͠ͳ͍͜ͱ͕Θ͔Δ શͯͷલॲཧΛద༻ͯ͠΋OFHBUJPOͷΈͷ৔߹ͱมΘΒͳ͍͔গ͠Լ͕Δ ͘Β͍ʢ0OJPO 3FEEJU 44&$ʣ 4UPQXPSET΍104͸ίʔύεαΠζΛେ͖͘ݮগͤ͞Δ͕ɺ104Ͱ͸ είΞݮগ͕ͳ͍

  17. #result XJLJQFEJBDPSQVTΛ࢖ͬͨ$#08 4LJQHSBN #&35ͷ'TDPSFൺֱ 8JLJQFEJBͰ΋ಉ༷ͷ܏޲͕ΑΓڧ·ͬͨ

  18. #result #preprocess #postprocess ֶशίʔύεʹલॲཧΛద༻͢Δ৔߹ʢQSFʣͱɺ ෼ྨσʔληοτʹલॲཧΛద༻͢Δ৔߹ʢQPTUʣͷൺֱ ௚ײ௨ΓɺQPTUͷΈ͕͍ͣΕͷ৔߹Ͱ΋࠷΋είΞ͕௿͘ͳͬͨ QSFͱCPUIͰείΞʹେ͖ͳ͕ࠩͳ͘ɺQSF͕࠷΋ॏཁͰ͋Δ͜ͱ͕ࣔ ͞Εͨ Ұൠతʹɺ୯ޠຒΊࠐΈ͕༩͑ΒΕͨ৔߹͸෼ྨσʔλΛ߹Θͤʹ͍͘͜ͱ ͕ଟ͍Α͏ʹࢥ͏ͷͰɺҙ֎ͳ݁Ռ

  19. #result #compare with SoTA 4P5"ϕʔεϥΠϯʹର͢ΔఏҊϞσϧͷධՁ શͯͷλεΫͰఏҊख๏܈͕4P5"Λ্ճΔ #&35͕Ұ൪ڧ͍ͷ͸౰વͳͷͰগͣ͠Δ͍͕ɺఏҊख๏શମͱͯ͠উͬͯ ͍Δ΋ͷ͕ଟ͍ʢ*.%# *"$ 0OJPO

    3FEEJU 44&$ʣ 4P5"ͳࣄલֶशϞσϧ͸ఏҊख๏ΑΓང͔ʹେ͖͍ίʔύεΛ࢖͍ͬͯΔ ͷͰɺQSFͷॏཁੑ͕Θ͔Δ
  20. #result #relative improvement GTDPSFͷઈର஋ͱجຊతͳલॲཧ͔Βͷ૬ରతͳվળ TFOUJNFOUBOBMZTJTͱTBSDBTNEFUFDUJPOͷͭͷόΠφϦλεΫΑ ΓϚϧνΫϥε෼ྨλεΫͰͷվળ෯ͷ΄͏͕एׯେ͖͍ ΑΓଟ͘ͷσʔληοτͰൺֱ͠ͳ͍ͱ·ͩ·ͩඍົͳࠩͰ͔͠ͳ͍Α͏ͳ ؾ͸͢Δ

  21. #key points ·ͱΊ ୯ޠຒΊࠐΈ࣌఺Ͱͷલॲཧ͕࠷΋λΠϛϯάͱͯ͠༗ޮͰ͋Δ͜ͱ ͕ࣔ͞Εͨ ୯ମͱͯ͠͸OFHBUJPO͕࠷΋ޮՌ͕͋ΓɺҰൠతͳTUPQXPSET ΍TUFNNJOH͸είΞΛԼ͛Δ͜ͱ͕ଟ͍ ڊେͳίʔύεͰֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͏ΑΓɺλεΫʹ߹Θ ͤͨલॲཧΛద੾ͳλΠϛϯάͰߦ͏͜ͱͰείΞͰ্ճΕΔ ҉໧஌తʹ஌ΒΕ͍ͯͨ஌ݟ͕ଟ͍͕ɺ͔ͬ͠Γͨ͠ݕূΛߦ͏͜ͱ

    Ͱମܥతͳ஌ࣝʹͨ͠