[SNLP2020] The Curious Case of Neural Text Degeneration

ಡΉਓ ཧݚAIP / ౦๺େֶ סݚڀࣨ ਗ਼໺ॢ The Curious Case of
Neural Text Degeneration Published as a conference paper at ICLR 2020 THE CURIOUS CASE OF NEURAL TEXT DeGENERATION Ari Holtzman †‡ Jan Buys §† Li Du † Maxwell Forbes †‡ Yejin Choi †‡ †Paul G. Allen School of Computer Science & Engineering, University of Washington ‡Allen Institute for Artiﬁcial Intelligence §Department of Computer Science, University of Cape Town {ahai,dul2,mbforbes,yejin}@cs.washington.edu, [email protected] ABSTRACT Despite considerable advances in neural language modeling, it remains an open question what the best decoding strategy is for text generation from a language model (e.g. to generate a story). The counter-intuitive empirical observation is that even though the use of likelihood as training objective leads to high quality models for a broad range of language understanding tasks, maximization-based decoding methods such as beam search lead to degeneration — output text that is ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢

͜ͷ࿦จʹ͍ͭͯ • ࠷ઌ୺ʁ • arXivͷॳग़͸2019೥4݄ • ICLR 2020࠾୒ ͳͷͰ࠷ઌ୺ͱ͍͏͜ͱʹ͢Δ •
ಡΜͩཧ༝ • Degeneration ͷ֓೦Λ஌͓ͬͯ͘ͱྑ͍͜ͱ͕͋Δ͔΋ 4

͓͜ͱΘΓɿGenerationʹ͍ͭͯ • Encoder Decoderͷ࿩Ͱ͸ͳ͍͜ͱʹ஫ҙ • ͜ͷ࿦จͷGeneration: ݴޠϞσϧʹΑΔจ຺ ෇͖ੜ੒ • Encoder͸ొ৔͠·ͤΜ
• ஶऀΒ͸Open-ended generationͱݺΜͰ͍Δ The Curious Case of Neural Text Degeneration Generationͱ͸ݴ͏΋ͷͷ… 5

ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒Ͱɺ DegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ • Degeneration: ܁Γฦ͠ੜ੒ ˍ
Ұ؏ੑͷͳ͍ੜ੒ • ΞΠσΞ • σίʔυํ๏Λ޻෉ͯ͠DegenerationΛ๷͙ • Nucleus Sampling: 5PQLTBNQMJOHͷLΛಈతʹಈ͔͢ • ߩݙ • DegenerationͷଘࡏΛࣔͨ͠ • Nucleus Samplingͷྑ͞Λఆੑతɾఆྔతʹࣔͨ͠ 6

ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 7

ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ ࠓ೔͸͜Εͷ࿩Λ͠·͢ GPT-3 175b 8

ࣄલ܇࿅ࡁΈݴޠϞσϧͰͷจੜ੒ • GPT-2Λ࢖ͬͯద౰ʹจষΛੜ੒ͯ͠ΈΔ • ୳ࡧʹ͸Beam SearchΛ࢖ͬͯΈΔ An unprecedented number of
mostly young whales have become stranded on the West Australian coast since 2008. The number of stranded whales has increased by more than 50 per cent in the past year, with the number of stranded whales on the West Australian coast increasing by more than 50 per cent in the past year. The number of whales stranded on the West Australian coast has increased by more than 50 per cent in the past year. GPT-2 w/ beam search ͳΜ͔ͩ܁Γฦ͕͠ଟ͍ؾ͕͢Δ 9

܁Γฦ͕͠ଓ͘ݪҼΛௐ΂ͯΈΔ • I don’t knowΛ200ճ܁Γฦͨ͠ܥྻΛ༩͑ͯΈΔ • ܁Γฦ͢͝ͱʹੜ੒֬཰্͕͕͍ͬͯ͘ • I don’t
know͕΍ΊΒΕͳ͍ͱ·Βͳ͍ঢ়ଶ Published as a conference paper at ICLR 2020 probability 0.4 0.5 0.6 0.7 0.8 0.9 1 I do n’t know . Figure 4: The probability of a repeated phrase increases with each repetition, creating a positive feedback loop. We found this effect to hold for the vast majority of phrases we tested, regardless of phrase length or if the phrases were sampled randomly rather than taken from human text. Figure 5: The probability mass assigned to partial human sentences. Flat distributions lead to many moderately probable tokens, while peaked distributions concentrate most probability mass into just τʔΫϯͷ ༧ଌ֬཰ 10

վળҊ1: SamplingͰੜ੒ͯ͠ΈΔ • Beam searchͰ֬཰ͷߴ͍ܥྻΛબͿͷ͕ྑ͘ͳ͍ʁ • SamplingͰੜ੒ͯ͠ΈΔ • Ϟσϧͷग़ͨ͠୯ޠ֬཰෼෍͔Βαϯϓϧ͢Δ͚ͩ An
unprecedented number of mostly young whales have become stranded on the West Australian coast since 2008. The Australian Food Safety Authority has warned Australia's beaches may be revitalised this year because healthy seabirds and seals have been on the move. More than 50,000 seabirds, sea mammals and seahorses have been swept into the sea by the Holden CS118 and Adelaide Airport CS300 from 2013. GPT-2 w/ sampling จ຺ʹؔ܎ͷͳ͍ϑϨʔζ͕ग़ͯ͠·͏ 11

վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏ • Top-k samplingͰੜ੒ͯ͠ΈΔ • ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά
Figure 4: The probability of a repeated phrase increases with each repetition, creati feedback loop. We found this effect to hold for the vast majority of phrases we tested, phrase length or if the phrases were sampled randomly rather than taken from human t Figure 5: The probability mass assigned to partial human sentences. Flat distributions Figure 4: The probability of a repeated phrase increases with each repetition, crea feedback loop. We found this effect to hold for the vast majority of phrases we teste phrase length or if the phrases were sampled randomly rather than taken from human Figure 5: The probability mass assigned to partial human sentences. Flat distribution Top-15͔ΒͷαϯϓϦϯάͷ༷ࢠ 12

վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏ • Top-k samplingͰੜ੒ͯ͠ΈΔ • ୯ޠ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά
An unprecedented number of mostly young whales have become stranded on the West Australian coast since 2008. Pumping Station #3 shut down due to construction damage Find more at: www.abc.net.au/environment/species- worry/ in-the-top-10-killer-whale- catastrophes-in-history.html “In the top 10 killer whale catastrophes in history: 1) 1986: Up to 12 orcas struck by lightning; many drowned and many more badly injured. GPT-2 w/ top-k sampling ܁Γฦ͠ੜ੒ͱؔ܎ͳ͍ϑϨʔζͷੜ੒ͷ߹Θٕͤঢ়ଶ 13

Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ • ༧ଌ֬཰෼෍͕ϑϥοτͳ৔߹: େ͖͍kΛ࢖͍͍ͨ • k͕খ͍͞ͱɺࣅͨΑ͏ͳ୯ޠ͹͔Γग़͢͜ͱʹͳΓ͕ͪ • ৭ʑͳ୯ޠΛग़ͤΔΑ͏ʹ͓͖͍ͯͨ͠ Figure
4: The probability of a repeated phrase increases w feedback loop. We found this effect to hold for the vast maj phrase length or if the phrases were sampled randomly rathe 14

Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ • ༧ଌ֬཰෼෍͕ϐʔΩʔͳ৔߹: খ͍͞kΛ࢖͍͍ͨ • k͕େ͖͍ͱɺ֬཰ͷখ͍͞୯ޠΛબΜͰ͠·͏ ةݥੑ͕͋Δ of a
repeated phrase increases with each repetition, creating a positive is effect to hold for the vast majority of phrases we tested, regardless of es were sampled randomly rather than taken from human text. mass assigned to partial human sentences. Flat distributions lead to many hot͔cooling ͘Β͍͕ બ͹Εͯ ཉ͍͠ؾ࣋ͪ 15

ఏҊख๏: Nucleus Sampling • ΞΠσΞɿಈతʹkΛܾΊΕ͹͍͍ͷͰ͸ʁ • top-p vocabulary (")Λ࢖͏͜ͱʹ͢Δ •
("): ҎԼͷෆ౳ࣜΛຬͨ͢Α͏ͳͷ࠷খ෦෼ू߹ • ؾ࣋ͪᶃ: ෼෍͕ϑϥοτͳͱ͖͸ (")ͷதʹ୯ޠ͕ͨ͘͞Μ • ؾ࣋ͪᶄ: ෼෍͕ϐʔΩʔͳͱ͖͸ (")ͷதͷ୯ޠ͸গͳΊ stic decoding method: Nucleus Sampling. The ke tion to determine the set of tokens to be sampled f s top-p vocabulary V (p) ⇢ V as the smallest set s X x2V (p) P(x|x1:i 1) p. 4 ͖͍͠஋ʢϋΠύϥʣ ྫ: 0.95 pΛ௒͑Δ·Ͱ ୯ޠΛ٧Ί͍ͯ͘ 16

Nucleus SamplingͰੜ੒ͯ͠Έͨ • ܁Γฦ͠໰୊͕ղܾͨ͠ʢΑ͏ʹݟ͑Δʣ • ؔ܎ͳ͍୯ޠ͕ੜ੒͞Ε͍ͯͳ͍ʢΑ͏ʹݟ͑Δʣ An unprecedented number of
mostly young whales have become stranded on the West Australian coast since 2008. There has been an unprecedented number of calves caught in the nets of whaling stations that operate in WA. Pilot whales continue to migrate to feeding grounds to feed their calves. They are now vulnerable due to the decline of wild populations; they are restricted to one breeding site each year. Image copyright Yoon Bo Kim But, with sharp decline in wild populations the size of the Petrels are shrinking and dwindling population means there will only be room for a few new fowl. GPT-2 w/ nucleus sampling 17

࣮ݧઃఆ • λεΫ: จ຺෇͖ੜ੒ • ೖྗɿจষͷઌ಄1~40τʔΫϯ • ग़ྗɿจষͷ࢒ΓΛੜ੒ • σʔλɿWeb͔Β͖࣋ͬͯͨ
5000 passages • Ϟσϧ: GPT-2 (large) • 40GB ͷςΩετϑΝΠϧͰֶशͨ͠΋ͷ An unprecedented number of mostly young whales have become stranded on the West Australian coast since 2008. There has been an unprecedented number of calves caught in the nets of whaling stations that operate in WA. Pilot whales continue to migrate to feeding grounds to feed their calves. They are now vulnerable due to the decline of wild populations; they are restricted to one breeding site each year. Image copyright Yoon Bo Kim But, with sharp decline in wild populations the size of the Petrels are shrinking and dwindling population means there will only be room for a few new fowl. ͜ΕΛ֤ख๏ɾ֤จষͰ΍Δ͜ͱʹ૬౰ 18

࣮ݧ݁Ռ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38 0.31
0.28 n/a Beam Search (b=16) 1.48 0.44 28.94 n/a Sampling 22.73 0.28 0.22 0.67 Top-k (k=40) 6.88 0.39 0.78 0.19 Top-k (k=640) 13.82 0.32 0.28 0.94 Nucleus (p=0.95) 13.13 0.32 0.36 0.97 19

࣮ݧ݁Ռͷݟํᶃ: Human Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38
0.31 0.28 n/a Beam Search (b=16) 1.48 0.44 28.94 n/a Sampling 22.73 0.28 0.22 0.67 Top-k (k=40) 6.88 0.39 0.78 0.19 Top-k (k=640) 13.82 0.32 0.28 0.94 Nucleus (p=0.95) 13.13 0.32 0.36 0.97 ݩίʔύεͷจʢਓ͕ؒॻ͍ͨจʣ 20

࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38
0.31 0.28 n/a Beam Search (b=16) 1.48 0.44 28.94 n/a Sampling 22.73 0.28 0.22 0.67 Top-k (k=40) 6.88 0.39 0.78 0.19 Top-k (k=640) 13.82 0.32 0.28 0.94 Nucleus (p=0.95) 13.13 0.32 0.36 0.97 ஋͕)VNBOʹ͍ۙ΄Ͳྑ͍ ͍ͭ΋ͷ 1FSQMFYJUZ ࣗ෼ͷੜ੒݁Ռ ͱͷ#-&6 %JWFSTJUZͷܭଌ ܁Γฦ͠Λ ؚΉจͷׂ߹ 21

࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38
0.31 0.28 n/a Beam Search (b=16) 1.48 0.44 28.94 n/a Sampling 22.73 0.28 0.22 0.67 Top-k (k=40) 6.88 0.39 0.78 0.19 Top-k (k=640) 13.82 0.32 0.28 0.94 Nucleus (p=0.95) 13.13 0.32 0.36 0.97 ஋͕ߴ͍΄Ͳྑ͍ ͬ͘͟Γݴ͏ͱ ਓखධՁʹ૬౰ 22

࣮ݧ݁Ռ: Nucleus Sampling͸ੌ͍ͷ͔ʁ Method Perplexity Self-BLEU Repetition (%) HUSE Human
12.38 0.31 0.28 n/a Beam Search (b=16) 1.48 0.44 28.94 n/a Sampling 22.73 0.28 0.22 0.67 Top-k (k=40) 6.88 0.39 0.78 0.19 Top-k (k=640) 13.82 0.32 0.28 0.94 Nucleus (p=0.95) 13.13 0.32 0.36 0.97 • Nucleus͸֤ධՁࢦඪͰྑ͍஋Λग़͍ͯ͠ΔʢΑ͏ʹݟ͑Δʣ • ಛʹHUSEʢਓखධՁʣͰ͸࠷ߴੑೳ • Top-k (k=640) ͱ΄ͱΜͲ͕ࠩͳ͍ʁ • Top-kΛ࢖͑͹ྑ͍ͷͰ͸ʁ • ஶऀ͍Θ͘ɺk=100Ҏ্͸ʮී௨ʯͰ͸ͳ͍஋ͱͷ͜ͱ 23

ʢ࠶ܝʣͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒ ͰɺDegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ • Degeneration: ܁Γฦ͠ੜ੒ ˍ
Ұ؏ੑͷͳ͍ੜ੒ • ΞΠσΞ • σίʔυํ๏Λ޻෉ͯ͠degenerationΛ๷͙ • Nucleus Sampling: Top-k samplingͷkΛಈతʹಈ͔͢ • ߩݙ • DegenerationͷଘࡏΛࣔͨ͠ • Nucleus Samplingͷྑ͞Λఆੑతɾఆྔతʹࣔͨ͠ 24

ಡΜͩײ૝ • GPT-2ͷΑ͏ͳݴޠϞσϧͰ΋degeneration͕ ى͜Δɺͱ͍͏ͷ͸໘ന͍ • ݱ৅ͱͯ͠஌͓͍ͬͯͯྑͦ͞͏ • ʮݴޠϞσϧʯҰൠʹͲΕ͘Β͍௨༻͢Δ࿩ ͳͷ͔Ṗ •
খن໛ͳσʔλɾϞσϧͰ΋ಉ͡໰୊͕ى͖Δͷ͔ʁ • GPT-3Ͱ͸Ͳ͏ͳͷ͔ʁ • Nucleus sampling͸ख๏ͱͯ͠ྑͦ͞͏ͳ΋ͷͷ Α͘ग़དྷͨώϡʔϦεςΟοΫͱ͍͏ҹ৅ • ֶशํ๏͔Βݟ௚͍ͨ͠ؾ࣋ͪ à unlikelihood loss? 25

[SNLP2020] The Curious Case of Neural Text Dege...

[SNLP2020] The Curious Case of Neural Text Degeneration

Shun Kiyono

More Decks by Shun Kiyono

Other Decks in Research

Featured

Transcript

ಡΉਓ ཧݚAIP / ౦๺େֶ סݚڀࣨ ਗ਼໺ॢ The Curious Case of

͜ͷ࿦จʹ͍ͭͯ • ࠷ઌ୺ʁ • arXivͷॳग़͸2019೥4݄ • ICLR 2020࠾୒ ͳͷͰ࠷ઌ୺ͱ͍͏͜ͱʹ͢Δ •

͓͜ͱΘΓɿGenerationʹ͍ͭͯ • Encoder Decoderͷ࿩Ͱ͸ͳ͍͜ͱʹ஫ҙ • ͜ͷ࿦จͷGeneration: ݴޠϞσϧʹΑΔจ຺ ෇͖ੜ੒ • Encoder͸ొ৔͠·ͤΜ

ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒Ͱɺ DegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ • Degeneration: ܁Γฦ͠ੜ੒ ˍ

ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 7

ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ ࠓ೔͸͜Εͷ࿩Λ͠·͢ GPT-3 175b 8

ࣄલ܇࿅ࡁΈݴޠϞσϧͰͷจੜ੒ • GPT-2Λ࢖ͬͯద౰ʹจষΛੜ੒ͯ͠ΈΔ • ୳ࡧʹ͸Beam SearchΛ࢖ͬͯΈΔ An unprecedented number of

܁Γฦ͕͠ଓ͘ݪҼΛௐ΂ͯΈΔ • I don’t knowΛ200ճ܁Γฦͨ͠ܥྻΛ༩͑ͯΈΔ • ܁Γฦ͢͝ͱʹੜ੒֬཰্͕͕͍ͬͯ͘ • I don’t

վળҊ1: SamplingͰੜ੒ͯ͠ΈΔ • Beam searchͰ֬཰ͷߴ͍ܥྻΛબͿͷ͕ྑ͘ͳ͍ʁ • SamplingͰੜ੒ͯ͠ΈΔ • Ϟσϧͷग़ͨ͠୯ޠ֬཰෼෍͔Βαϯϓϧ͢Δ͚ͩ An

վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏ • Top-k samplingͰੜ੒ͯ͠ΈΔ • ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά

վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏ • Top-k samplingͰੜ੒ͯ͠ΈΔ • ୯ޠ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά

Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ • ༧ଌ֬཰෼෍͕ϑϥοτͳ৔߹: େ͖͍kΛ࢖͍͍ͨ • k͕খ͍͞ͱɺࣅͨΑ͏ͳ୯ޠ͹͔Γग़͢͜ͱʹͳΓ͕ͪ • ৭ʑͳ୯ޠΛग़ͤΔΑ͏ʹ͓͖͍ͯͨ͠ Figure

Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ • ༧ଌ֬཰෼෍͕ϐʔΩʔͳ৔߹: খ͍͞kΛ࢖͍͍ͨ • k͕େ͖͍ͱɺ֬཰ͷখ͍͞୯ޠΛબΜͰ͠·͏ ةݥੑ͕͋Δ of a

ఏҊख๏: Nucleus Sampling • ΞΠσΞɿಈతʹkΛܾΊΕ͹͍͍ͷͰ͸ʁ • top-p vocabulary (")Λ࢖͏͜ͱʹ͢Δ •

Nucleus SamplingͰੜ੒ͯ͠Έͨ • ܁Γฦ͠໰୊͕ղܾͨ͠ʢΑ͏ʹݟ͑Δʣ • ؔ܎ͳ͍୯ޠ͕ੜ੒͞Ε͍ͯͳ͍ʢΑ͏ʹݟ͑Δʣ An unprecedented number of

࣮ݧઃఆ • λεΫ: จ຺෇͖ੜ੒ • ೖྗɿจষͷઌ಄1~40τʔΫϯ • ग़ྗɿจষͷ࢒ΓΛੜ੒ • σʔλɿWeb͔Β͖࣋ͬͯͨ

࣮ݧ݁Ռ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38 0.31

࣮ݧ݁Ռͷݟํᶃ: Human Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38

࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38

࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ Method Perplexity Self-BLEU Repetition (%) HUSE Human 12.38

࣮ݧ݁Ռ: Nucleus Sampling͸ੌ͍ͷ͔ʁ Method Perplexity Self-BLEU Repetition (%) HUSE Human

ʢ࠶ܝʣͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒ ͰɺDegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ • Degeneration: ܁Γฦ͠ੜ੒ ˍ

ಡΜͩײ૝ • GPT-2ͷΑ͏ͳݴޠϞσϧͰ΋degeneration͕ ى͜Δɺͱ͍͏ͷ͸໘ന͍ • ݱ৅ͱͯ͠஌͓͍ͬͯͯྑͦ͞͏ • ʮݴޠϞσϧʯҰൠʹͲΕ͘Β͍௨༻͢Δ࿩ ͳͷ͔Ṗ •