Slide 1

Slide 1 text

จݙ঺հʢʣ MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance Wei Zhao† , Maxime Peyrard† , Fei Liu‡ , Yang Gao† , Christian M. Meyer† , Steffen Eger† EMNLP2019 ௕Ԭٕज़Պֶେֶ ࣗવݴޠॲཧݚڀࣨɹ ૬ాɹଠҰ

Slide 2

Slide 2 text

Abstract • ੜ੒ͷλεΫʹ͓͍ͯɺؤڧͳධՁई౓Λௐࠪ • จ຺Λߟྀͨ͠୯ޠ෼ࢄදݱ ͱ Word Mover’s Distance ͷ૊Έ߹Θ͕ͤ࠷΋ྑ͔ͬͨ • ιʔείʔυΛެ։ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ https://github.com/AIPHES/emnlp19-moverscore 2

Slide 3

Slide 3 text

Related work • ৭ʑͳධՁख๏ʢ1ʣ • ཁ໿ɿROUGE(Lin 2004) • ػց຋༁ɿBLEU(Papinemi 2002), RUSE(Shimanaka 2018) • Image CaptioningɿBLEU, CIDEr(Vedantam 2015), SPICE(Anderson 2016) 3 #-&6͸޲͔ͳ͍

Slide 4

Slide 4 text

Related work • ৭ʑͳධՁख๏ʢ2ʣ • ҙຯతྨࣅ౓ɿ “BERTScore”(Zhang 2019) • ຋༁ɿڭࢣ͋Γɾڭࢣͳ͠ BERT ෼ࢄදݱ(Mathur 2019) • ཁ໿ɺΤοηΠ࠾఺ɿELMo + Sentence Mover’s Simirality(Clark 2019) 4 จ຺Λߟྀͨ͠෼ࢄදݱ $POUFYUVBMJ[FESFQSFTFOUBUJPO Λ༻͍Δख๏͕૿͖͑ͯͨ ࣮ݧͷ#BTFMJOFʹग़͖ͯ·͢

Slide 5

Slide 5 text

Method • ༷ʑͳੜ੒λεΫΛධՁͰ͖Δࢦඪ(MoverScore)Λௐࠪ • ੜ੒จͱࢀরจͷྨࣅ౓ʢʁʣΛଌΔ • จ຺Λߟྀͨ͠෼ࢄදݱɿBERT, ELMo • ग़ྗจͱࢀরจͷҙຯతڑ཭ɿWord Mover's Distance 5

Slide 6

Slide 6 text

Method • MoverScore Variations • Granularityɿn-gram (n=1, 2, size-of-sentence) • Embeddingɿword2vec, BERT, ELMo • Fine-tuningɿMultiNLI, QANLI, QQP • Aggregationɿpower means, routing mechanism 6 /-* 1BSBQISBTJOH #&35 &-.P #&35

Slide 7

Slide 7 text

Method • MoverScore Variations • Granularityɿn-gram (n=1, 2, size-of-sentence) • Embeddingɿword2vec, BERT, ELMo • Fine-tuningɿMultiNLI, QANLI, QQP • Aggregationɿpower means, routing mechanism 7 #&35 &-.P

Slide 8

Slide 8 text

Method • Aggregation ʢ౷߹ํ๏ʣ • จ຺Λߟྀͨ͠෼ࢄදݱɿBERT, ELMo • ֤୯ޠ͸֤૚͔ΒͦΕͧΕҟͳΔϕΫτϧ͕౉͞ΕΔ • Power MeansɿฏۉΛऔΓ ৐( )ɺconcat • Routing Mechanismɿৄ͘͠͸(Zhang 2018) p p = 1, ± ∞ 8

Slide 9

Slide 9 text

Method • ग़ྗจͱࢀরจͷҙຯతڑ཭ • Word Mover's Distance (WMD) • Sentence Mover's Distance (SMD) • ઌ΄Ͳͷ૊Έ߹ΘͤΛɺWMD, SMD ͦΕͧΕͰݕূ͢Δ 9

Slide 10

Slide 10 text

Experiment • Tasks • ػց຋༁ • ཁ໿ • ର࿩ʢλεΫࢤ޲ʣ • Image Captioning 10 ʢࢀরจɺෳ਺ͷγεςϜʹΑΔग़ྗจʣͷϖΞ γεςϜͷग़ྗจʹ͸ਓखධՁ͕͞Ε͍ͯΔ ʲධՁࢦඪɺMoverScore Ͱ΍Δ͜ͱʳ ɾγεςϜͷग़ྗจΛධՁ ɾਓखධՁͱͷ૬ؔΛݟΔ

Slide 11

Slide 11 text

Experiment • ػց຋༁ • DatasetɿWMT2017 • ࢀՃγεςϜͷग़ྗจʹɺ࠷௿Ͱ΋15ਓͷਓखධՁ • BaselinesɿSentBLEU, METEOR++, RUSE, BERTScore(Zhang 2019) 11

Slide 12

Slide 12 text

Result • WMD+BERT+MNLI+PMeans ͕ Baseline Λ্ճΔ 12

Slide 13

Slide 13 text

Result • Sentence Representation Ͱ͸৘ใ͕ࣦΘΕΔʁ 13

Slide 14

Slide 14 text

Experiment • ཁ໿ • DatasetɿTAC-2008, TAC-2009 • Responsivenessɿ಺༰ʴจ๏తͳ඼࣭ • Pyramidɿࢀরจʹؚ·ΕΔॏཁͳ಺༰͕ͲΕ͚ͩଟ͘Χόʔ͞ Ε͍ͯΔ͔ • BaselinesɿROUGE-1, ROUGE-2, (Peyrard 2017), BERTScore(Zhang 2019) S3 best 14 ڭࢣ͋ΓͷධՁࢦඪ

Slide 15

Slide 15 text

Result • WMD+BERT+MNLI+PMeans Ͱ Baselines Λ্ճΔ 15

Slide 16

Slide 16 text

Experiment • ର࿩ʢλεΫࢤ޲ʣ • DatasetɿBAGEL, SFHOTEL • Informativeness (Inf)ɿఏڙ͢Δ৘ใྔ • Naturalness (Nat)ɿਓͷԠ౴΁ͷۙ͞ • Quality (Qual)ɿྲྀெੑɾจ๏ • BaselinesɿBLEU, METEOR, BERTScore(Zhang 2019) 16

Slide 17

Slide 17 text

Result • શମతʹ૬͕ؔ௿͍͕ɺఏҊख๏͸ͦͷதͰ΋ߴ͍ํ 17

Slide 18

Slide 18 text

Experiment • Image Captioning • DatasetɿMSCOCO • M1 ~ M5 ͷධՁ͕͋Δ • ࠓճ͸ɺશମͷ඼࣭ʹؔ͢ΔM1, M2 Λ࠾༻ • BaselinesɿCIDEr, SPICE, METEOR, LEIC(Cui 2018), BERTScore(Zhang 2019) 18 ڭࢣ͋ΓͷධՁࢦඪ

Slide 19

Slide 19 text

Result • Baseline ͷ LEIC ʹྼΔ͕ɺͦΕͰ΋ߴ͍૬ؔΛࣔ͢ 19 M: BERT fine-tuning ʹ MultiNLI Λ࢖༻ P: ELMo / BERT ͷ౷߹ (Aggregation) ʹ Power Means Λ࢖༻

Slide 20

Slide 20 text

Discussion • ࣮ݧͷ Baseline ͱͯ͠ग़͖ͯͨ BERTScore ͱͷൺֱ 20

Slide 21

Slide 21 text

Discussion • ࣮ݧͷ Baseline ͱͯ͠ग़͖ͯͨ BERTScore ͱͷൺֱ 21 One-to-one ͷڧ͍ alignment Many-to-one ͷऑ͍ alignment WMD Ͱద੾ͳڑ཭ ͕औΕ͍ͯΔ

Slide 22

Slide 22 text

Discussion • ػց຋༁ͰਓखධՁͷߴ͍΋ͷ(good)ͱ௿͍΋ͷ(bad)ͷɹ 2ͭʹ෼͚ɺ෼෍Λௐࠪ • ൺֱର৅ • Baseline: SentBLEU • Proposal: MoverScore(WMD+BERT) 22

Slide 23

Slide 23 text

Discussion • SentBLEU ͸ਓखධՁ͕ྑͯ͘΋தఔ౓ͷՕॴʹଟ͘෼෍ • MoverScore ͸៉ྷʹ2ͭͷۃΛදݱͰ͖͍ͯΔ 23

Slide 24

Slide 24 text

Conclusion • ੜ੒λεΫͷڭࢣͳ͠ධՁࢦඪΛఏҊ • 4ͭͷੜ੒λεΫͰ Baselines Λ ௒͑Δ/ഭΔ ݁Ռʹ • ιʔείʔυΛެ։ɹɹɹɹɹɹɹɹɹɹɹɹɹɹɹ https://github.com/AIPHES/emnlp19-moverscore 24