Slide 1

Slide 1 text

15min LT: A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks

Slide 2

Slide 2 text

಺ڮ ݎࢤ uchi_k @__uchi_k__ About me yuni, inc. ୅ද nlpaper.challenge ӡӦ Freelance Machine Learning ɹɹɹɹɹEngineer / Researcher former ژେ৘ใӃ, ະ౿16 FreakOut Machine Learning Engineer

Slide 3

Slide 3 text

About yuni େاۀ͔ΒελʔτΞοϓɺݚڀػؔͳͲ͔Βػցֶशؔ࿈ͷडୗ։ൃ Λߦ͖ͬͯ·ͨ͠ σʔλυϦϒϯͳ΋ͷͮ͘Γࣄۀͱͯ͠ɺΦϯϥΠϯ໰਍ʹΑΔύʔι φϥΠζ৸۩ͷ੡࡞΋΍͍ͬͯͨΓ͠·͢ ࡢ೥຤૑ۀͨ͠ɺγʔυظʹ͋ΔελʔτΞοϓʢࣾһ໊ʣͰ͢ ػցֶशºϚʔέςΟϯάྖҬͰαʔϏε։ൃΛ͍ͯ͠·͢ 6($ͷղੳ͕ओͳ࢓ࣄͰɺࠓ೔ͷ࿦จ΋ͦΕʹؔ܎ͨ͠ײ৘ղੳʹͭ ͍ͯ

Slide 4

Slide 4 text

#distributional hypothesis #word embedding ෼෍Ծઆʹجͮ͘୯ޠຒΊࠐΈͷݶք ʮ޾ͤʯͱʮ൵͠ΈʯͷϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓ ͕ߴ͘ͳΔɺͳͲ௚ײʹ൓͢Δྨࣅ౓͕ಘΒΕΔ͜ͱ΋͋ΓɺλεΫ ͝ͱʹ୯ޠຒΊࠐΈΛௐ੔͢Δඞཁ͕͋Δ The Distributional Hypothesis is that words that occur in the same contexts tends to have similar meanings [Harris, 1954]. ࣅͨจ຺Ͱසൟʹग़ݱ͢Δ୯ޠಉ࢜͸ҙຯతʹྨࣅ͍ͯ͠Δͱߟ͑ͯɺ ຒΊࠐΈۭؒͰ΋ۙ͘ͳΔͱ͍͏Ծઆ ୯ޠͷҙຯΛܾΊΔͨΊͷҰͭͷํ๏ͱͯ͠ɺ෼෍Ծઆ͕͋Δɻ ౷ܭతʹ୯ޠͷҙຯΛಘΔͨΊͷํ๏ͰɺXPSEWFDͷΑ͏ͳਪ࿦ ϕʔεͷϞσϧ΍୯ʹ౷ܭ৘ใΛ࣍ݩ࡟ݮ͢ΔΧ΢ϯτϕʔεͷख๏΋ ͋Δ

Slide 5

Slide 5 text

"$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT #abstract /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF FUBM "$- ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ ֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͍͕͚ͪͩͲɺྫ͑͹ʮ޾ͤʯͱʮ൵͠Έʯͷ ϖΞ͕ʮ޾ͤʯͱʮتͼʯͷϖΞΑΓྨࣅ౓͕ߴ͘ͳΔΑ͏ͳຒΊࠐΈ͕ଘ ࡏ͢Δͷʹײ৘ೝ͕ࣝຊ౰ʹղ͚Δʁ 4UPQXPSET OFHBUJPO 104 MFNNBUJ[BUJPOͳͲͷલॲཧΛͲ͏࢖͏͔ ͕ຊ࣭తʹॏཁͳͷͰ͸ʁ લॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹͷେ͖͞Λݕূ͠ɺैདྷͷ࣮ݧઃఆͷݟ ௚͠Λߦ͍͍ͨ

Slide 6

Slide 6 text

ؔ࿈ݚڀʢXPSEWFD 6($ʣ • *NQSPWJOH%JTUSJCVUJPOBM4JNJMBSJUZXJUI-FTTPOTGSPN8PSE &NCFEEJOHT ◦ 0NFS-FWZ #BS*MBO6OJWFSTJUZ FUBM "$- ◦ 8PSEFNCFEEJOHʹ͓͍ͯɺΧ΢ϯτϕʔεͷख๏Ͱ΋ϋΠύʔύϥϝʔ λௐ੔࣍ୈͰXPSEWFDͳͲͷਪ࿦ϕʔεͷख๏ʹউͯΔ͜ͱΛࣔͨ͠ • /-'**5BU*&45&NPUJPO3FDPHOJUJPOVUJMJ[JOH/FVSBM /FUXPSLTBOE.VMUJMFWFM1SFQSPDFTTJOH ◦ 4BNVFM1FDBS 4MPWBL6OJWFSTJUZPG5FDIOPMPHZ FUBM &./-1 ◦ 6TFSHFOFSBUFEDPOUFOUTΛ࢖༻͢Δ৔߹ͷલॲཧͷॏཁੑʹ͍ͭͯௐ΂ ͍ͯΔɻಛʹإจࣈ΍ֆจࣈͷೝࣝΛৄ͘͠ߦ͍είΞΛ্͛Δ͜ͱʹ੒ޭ #recent study #ugc #word2vec

Slide 7

Slide 7 text

ؔ࿈ݚڀʢલॲཧʣ • 0OTUPQXPSET pMUFSJOHBOEEBUBTQBSTJUZGPSTFOUJNFOU BOBMZTJTPGUXJUUFS ◦ )BTTBO4BJG ,OPXMFEHF.FEJB*OTUJUVUF 5IF0QFO6OJWFSTJUZ FUBM -3&$ ◦ ετοϓϫʔυͷআڈ͕༗ޮ͔ͦ͏Ͱͳ͍͔͸ϫʔυϦετͷ࡞Γํ΍λε ΫͰେ͖͘ҟͳΔ͕ɺUXJUUFSTFOUJNFOUͰ͸Ұൠతͳํ๏ͩͱ֐ͷํ͕େ ͖͍͜ͱΛࣔͨ͠ • "DPNQBSBUJWFFWBMVBUJPOPGQSFQSPDFTTJOHUFDIOJRVFTBOE UIFJSJOUFSBDUJPOTGPSUXJUUFSTFOUJNFOUBOBMZTJT ◦ 4ZNFPO4ZNFPOJEJT &YQFSU4ZTUFNTXJUI"QQMJDBUJPOT ◦ લॲཧͷςΫχοΫΛ৭ʑࢼͯ͠ΈͨΒɺײ৘෼ੳͰ͸MFNNBUJ[BUJPOͱ ਺ࣈͷআڈɺ୹ॖܗͷஔ׵͕࠷΋είΞʹد༩ #recent study #preprocessing #emotion

Slide 8

Slide 8 text

"$PNQSFIFOTJWF"OBMZTJTPG1SFQSPDFTTJOH GPS8PSE3FQSFTFOUBUJPO-FBSOJOHJO"⒎FDUJWF5BTLT #abstract λεΫݻ༗ͷඍௐ੔΍Ϟσϧͷվળ΋ॏཁͰ͸͋Δ͕ɺઌߦݚڀ͔Β͸લॲ ཧ΍ϋΠύʔύϥϝʔλͷӨڹ͕ແࢹͰ͖ͳ͍͜ͱ͕ಡΈऔΕΔ ֶश༻σʔλͷ୯ޠຒΊࠐΈΛߦ͏લɾޙͦΕͧΕͷλΠϛϯάͰલॲཧΛ ߦͬͨΓɺςετσʔλͷલॲཧͱ߹ΘͤͨΓ߹Θͤͳ͔ͬͨΓΛࢼ͢ /BTUBSBO#BCBOFKBE %FQBSUNFOUPG&MFDUSJDBM&OHJOFFSJOHBOE$PNQVUFS4DJFODF FUBM "$- ಛʹײ৘ೝࣝܥͷλεΫʹ͓͍ͯલॲཧ͕୯ޠຒΊࠐΈʹ༩͑ΔӨڹΛௐ΂ɺ Α͘ߦΘΕΔ࣮ݧઃఆ͕ຊ౰ʹਖ਼͍͠ͷ͔ݕূ͢Δ

Slide 9

Slide 9 text

#key points ͜ͷ࿦จΛ঺հ͢Δཧ༝ ໘ന͍৽نख๏΋ͨ͘͞Μ͋Δ͕ɺ࣮ӡ༻Ͱਫ਼౓͕ग़ͤΔ΋ͷ͕ͳ͔ ͳ͔ͳ͍ͱײ͍ͯͨ͡ ݁ہલॲཧͷબͼํ΍ख๏ͷҧ͍͕େ͖͘είΞʹӨڹ͍ͯ͠Δ͕ɺ ࿦จͰͦΕΛ࿦͍ͯ͡Δ΋ͷ͕΄ͱΜͲͳ͍ ҉໧஌తͳલॲཧͷ஌ࣝΛ·ͱΊΔ͍͍ػձʹͳΕ͹͍͍͔ͳͱࢥͬ ͨ

Slide 10

Slide 10 text

#key points ΍ͬͨ͜ͱ લॲཧΛ୯ޠຒΊࠐΈʹ౷߹͢ΔͱͲΜͳޮՌ͕͋Δ͔ʁ Ͳͷલॲཧ͕ײ৘෼ੳܥͷλεΫʹޮՌ͕͋Δͷ͔ʁ ࣄલֶश͞Εͨ΋ͷΑΓվળ͞Ε͍ͯΔ͔ʁ ͭͷֶशσʔλɺͭͷςετσʔλΛ࢖༻ͨ͠ײ৘ܥλεΫͰɺֶ शσʔλɺ෼ྨσʔλɺ྆ํɺͦΕͧΕʹલॲཧΛద༻ͨ͠৔߹Ͱൺ ֱ ݕূͨ͜͠ͱ

Slide 11

Slide 11 text

#preprocessing #pipeline /-1ʹ͓͚ΔલॲཧͷྲྀΕ ΫϦʔχϯά ෼ׂ ਖ਼نԽ ѹॖ ϕΫτϧԽ λά ه߸ͳͲͷআڈ QVODUVBUJPO ܗଶૉղੳ ࣙॻͷ௥Ճ ܎Γड͚ղੳ ਺ࣈͷஔ͖׵͑ إจࣈͳͲͷೝࣝ TQFMMDIFDL දهΏΕ MPXFSDBTJOH ୅දޠ΁ͷஔ͖׵͑ লུޠ MFNNBUJ[BUJPO TUFNNJOH OFHBUJPO Φϯτϩδʔ 4UPQXPSEͷআڈ 104 $#08 TLJQHSBN #&35 DPWFSBHFͷௐࠪ ෼ྨσʔλͱޠኮΛ͚ۙͮΔ FUD

Slide 12

Slide 12 text

#preprocessing #negation /FHBUJPO • ൓ҙޠࣙॻͷ࡞੒ ◦8PSE/FUίʔύεͰ൓ҙޠࣙॻΛ࡞੒ ◦൓ҙޠ͕ݟ͔ͭΒͳ͍PSͭͰ͋Ε͹ͦͷ··ɺෳ਺͋Δ৔߹͸ VL8BDίʔύεͷதͰ࠷େͷස౓Λ࣋ͭ൓ҙޠͱͨ͠Γ୯ʹϥϯμϜ ʹબ୒ͨ͠Γ • ൱ఆޠͷ൓ҙޠ΁ͷஔ׵ ◦൱ఆޠ͕ݟ͔ͭͬͨ৔߹ɺଓ͘୯ޠΛநग़͠ɺ൓ҙޠࣙॻͰ൓ҙޠΛ ݕࡧɻ൓ҙޠ͕ݟ͔ͭͬͨ৔߹ɺ൱ఆޠͱ൱ఆ͞ΕͨޠΛͦΕʹஔ͖ ׵͑Δ ◦ྫ͑͹ɺͱ͍͏จͰ͸ɺ൱ఆޠʢ`OPUʣ ͱͦΕʹରԠ͢Δ୯ޠʢIBQQZʣΛಛఆɻ൓ҙޠࣙॻͰbIBQQZ`ͷ൓ ҙޠʢ`TBE`ʣΛ୳͠ɺOPUIBQQZ`ΛbTBE`ʹஔ͖׵͑Δ

Slide 13

Slide 13 text

#corpus #training #dataset /FXT શମͱͯ͠ɺ4UPQXPSEͷআڈ΍104Ͱ͸WPDBCTJ[F͸͋·Γม ΘΒͳ͍͕DPSQVTTJ[F͕େ͖͘ݮগ ʙ೥ͷΞϝϦΧͷͷग़ ൛෺͔Βͷ ݅ͷهࣄ 8JLJQFEJB 8JLJQFEJBͷهࣄ ݅Ͱ ߏ੒͞ΕΔɺ/FXTΑΓ໿ഒେ͖ ͍ίʔύε 5SBJOJOH$PSQVT ͭͷαΠζɾੑ࣭ͷҟͳΔίʔύεʹͭͷલॲཧΛߦ͏

Slide 14

Slide 14 text

#corpus #evaluation #dataset &WBMVBUJOH$PSQVT 4FOUJNFOUBOBMZTJT FNPUJPODMBTTJpDBUJPO TBSDBTNEFUFDUJPOͷͭͷλεΫͰධՁɻ • *.%# ◦ ݅ͷөըϨϏϡʔɻϙδωΨൺ • 4FN&WBM ◦ ໿πΠʔτɻϙδωΨൺ • "JSMJOF ◦ ߤۭձࣾࣾʹؔ͢Δ໿݅πΠʔτɻ 4FOUJNFOUBOBMZTJTײ৘ϙδωΨ • *4&"3 ◦ ໿݅ͷɺײ৘Λשى͢Δݸਓతͳ࿩ • "MN ◦ ໿݅ͷ͓ͱ͗࿩ • 44&$ ◦ 4FN&WBMΛ࠶Ξϊςʔγϣϯͨ͠໿݅ͷπ Πʔτ &NPUJPO%FUFDUJPOײ৘Ϋϥε෼ྨ 4BSDBTN%FUFDUJPOൽ೑ͷݕग़ • 0OJPO ◦ ൽ೑Λѻ͏ϝσΟΞͱͦ͏Ͱͳ͍ϝσΟΞ͔Βऩू ͨ͠໿݅ͷχϡʔεϔουϥΠϯ • *"$ ◦ ໿݅ͷൃ࿩Ԡ౴ • 3FEEJU ◦ ஶऀ͕ϥϕϧ෇͚ͨ͠໿ສ݅ͷ3FEEJU౤ߘ

Slide 15

Slide 15 text

#result /FHBUJPO͕શͯͷσʔληοτʹ͓͍ͯ࠷΋ޮՌతͩͬͨ /FXTίʔύεʹલॲཧΛߦͬͨ͋ͱ୯ޠຒΊࠐΈΛ࡞੒ͨ͠৔߹ͷGTDPSF /FHBUJPOҎ֎ͷલॲཧͷෳ߹ΛؚΊͯ΋ɺOFHBUJPOͷΈͷ৔߹͕ৗʹ ൪໨ʹείΞ͕ߴ͔ͬͨ

Slide 16

Slide 16 text

#result ʢҰൠతͳʣ4UPQXPSET TUFNNJOH͸ɺ୯ମͰ͸ҙຯ͕͋Γͦ͏ʹݟ͑ ͯ΋ଞͷલॲཧͱಉ࣌ͩͱείΞʹد༩͍ͯ͠ͳ͍͜ͱ͕Θ͔Δ શͯͷલॲཧΛద༻ͯ͠΋OFHBUJPOͷΈͷ৔߹ͱมΘΒͳ͍͔গ͠Լ͕Δ ͘Β͍ʢ0OJPO 3FEEJU 44&$ʣ 4UPQXPSET΍104͸ίʔύεαΠζΛେ͖͘ݮগͤ͞Δ͕ɺ104Ͱ͸ είΞݮগ͕ͳ͍

Slide 17

Slide 17 text

#result XJLJQFEJBDPSQVTΛ࢖ͬͨ$#08 4LJQHSBN #&35ͷ'TDPSFൺֱ 8JLJQFEJBͰ΋ಉ༷ͷ܏޲͕ΑΓڧ·ͬͨ

Slide 18

Slide 18 text

#result #preprocess #postprocess ֶशίʔύεʹલॲཧΛద༻͢Δ৔߹ʢQSFʣͱɺ ෼ྨσʔληοτʹલॲཧΛద༻͢Δ৔߹ʢQPTUʣͷൺֱ ௚ײ௨ΓɺQPTUͷΈ͕͍ͣΕͷ৔߹Ͱ΋࠷΋είΞ͕௿͘ͳͬͨ QSFͱCPUIͰείΞʹେ͖ͳ͕ࠩͳ͘ɺQSF͕࠷΋ॏཁͰ͋Δ͜ͱ͕ࣔ ͞Εͨ Ұൠతʹɺ୯ޠຒΊࠐΈ͕༩͑ΒΕͨ৔߹͸෼ྨσʔλΛ߹Θͤʹ͍͘͜ͱ ͕ଟ͍Α͏ʹࢥ͏ͷͰɺҙ֎ͳ݁Ռ

Slide 19

Slide 19 text

#result #compare with SoTA 4P5"ϕʔεϥΠϯʹର͢ΔఏҊϞσϧͷධՁ શͯͷλεΫͰఏҊख๏܈͕4P5"Λ্ճΔ #&35͕Ұ൪ڧ͍ͷ͸౰વͳͷͰগͣ͠Δ͍͕ɺఏҊख๏શମͱͯ͠উͬͯ ͍Δ΋ͷ͕ଟ͍ʢ*.%# *"$ 0OJPO 3FEEJU 44&$ʣ 4P5"ͳࣄલֶशϞσϧ͸ఏҊख๏ΑΓང͔ʹେ͖͍ίʔύεΛ࢖͍ͬͯΔ ͷͰɺQSFͷॏཁੑ͕Θ͔Δ

Slide 20

Slide 20 text

#result #relative improvement GTDPSFͷઈର஋ͱجຊతͳલॲཧ͔Βͷ૬ରతͳվળ TFOUJNFOUBOBMZTJTͱTBSDBTNEFUFDUJPOͷͭͷόΠφϦλεΫΑ ΓϚϧνΫϥε෼ྨλεΫͰͷվળ෯ͷ΄͏͕एׯେ͖͍ ΑΓଟ͘ͷσʔληοτͰൺֱ͠ͳ͍ͱ·ͩ·ͩඍົͳࠩͰ͔͠ͳ͍Α͏ͳ ؾ͸͢Δ

Slide 21

Slide 21 text

#key points ·ͱΊ ୯ޠຒΊࠐΈ࣌఺Ͱͷલॲཧ͕࠷΋λΠϛϯάͱͯ͠༗ޮͰ͋Δ͜ͱ ͕ࣔ͞Εͨ ୯ମͱͯ͠͸OFHBUJPO͕࠷΋ޮՌ͕͋ΓɺҰൠతͳTUPQXPSET ΍TUFNNJOH͸είΞΛԼ͛Δ͜ͱ͕ଟ͍ ڊେͳίʔύεͰֶशࡁΈͷ୯ޠຒΊࠐΈΛ࢖͏ΑΓɺλεΫʹ߹Θ ͤͨલॲཧΛద੾ͳλΠϛϯάͰߦ͏͜ͱͰείΞͰ্ճΕΔ ҉໧஌తʹ஌ΒΕ͍ͯͨ஌ݟ͕ଟ͍͕ɺ͔ͬ͠Γͨ͠ݕূΛߦ͏͜ͱ Ͱମܥతͳ஌ࣝʹͨ͠