Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[SNLP2020] The Curious Case of Neural Text Degeneration

Shun Kiyono
September 16, 2020

[SNLP2020] The Curious Case of Neural Text Degeneration

第12回最先端NLP勉強会の資料です

Shun Kiyono

September 16, 2020
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ಡΉਓ
    ཧݚAIP / ౦๺େֶ סݚڀࣨ
    ਗ਼໺ॢ
    The Curious Case of
    Neural Text Degeneration
    Published as a conference paper at ICLR 2020
    THE CURIOUS CASE OF
    NEURAL TEXT DeGENERATION
    Ari Holtzman
    †‡
    Jan Buys
    §†
    Li Du

    Maxwell Forbes
    †‡
    Yejin Choi
    †‡
    †Paul G. Allen School of Computer Science & Engineering, University of Washington
    ‡Allen Institute for Artificial Intelligence
    §Department of Computer Science, University of Cape Town
    {ahai,dul2,mbforbes,yejin}@cs.washington.edu, [email protected]
    ABSTRACT
    Despite considerable advances in neural language modeling, it remains an open
    question what the best decoding strategy is for text generation from a language
    model (e.g. to generate a story). The counter-intuitive empirical observation is
    that even though the use of likelihood as training objective leads to high quality
    models for a broad range of language understanding tasks, maximization-based
    decoding methods such as beam search lead to degeneration — output text that is
    ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢

    View full-size slide

  2. ͜ͷ࿦จʹ͍ͭͯ
    • ࠷ઌ୺ʁ
    • arXivͷॳग़͸2019೥4݄
    • ICLR 2020࠾୒ ͳͷͰ࠷ઌ୺ͱ͍͏͜ͱʹ͢Δ
    • ಡΜͩཧ༝
    • Degeneration ͷ֓೦Λ஌͓ͬͯ͘ͱྑ͍͜ͱ͕͋Δ͔΋
    4

    View full-size slide

  3. ͓͜ͱΘΓɿGenerationʹ͍ͭͯ
    • Encoder Decoderͷ࿩Ͱ͸ͳ͍͜ͱʹ஫ҙ
    • ͜ͷ࿦จͷGeneration: ݴޠϞσϧʹΑΔจ຺
    ෇͖ੜ੒
    • Encoder͸ొ৔͠·ͤΜ
    • ஶऀΒ͸Open-ended generationͱݺΜͰ͍Δ
    The Curious Case of
    Neural Text Degeneration
    Generationͱ͸ݴ͏΋ͷͷ…
    5

    View full-size slide

  4. ͲΜͳ࿦จ͔ʁ
    • എܠɾ໰୊
    • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒Ͱɺ
    DegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ
    • Degeneration: ܁Γฦ͠ੜ੒ ˍ Ұ؏ੑͷͳ͍ੜ੒
    • ΞΠσΞ
    • σίʔυํ๏Λ޻෉ͯ͠DegenerationΛ๷͙
    • Nucleus Sampling: 5PQLTBNQMJOHͷLΛಈతʹಈ͔͢
    • ߩݙ
    • DegenerationͷଘࡏΛࣔͨ͠
    • Nucleus Samplingͷྑ͞Λఆੑతɾఆྔతʹࣔͨ͠
    6

    View full-size slide

  5. ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅
    https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
    7

    View full-size slide

  6. ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅
    https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
    ࠓ೔͸͜Εͷ࿩Λ͠·͢
    GPT-3
    175b
    8

    View full-size slide

  7. ࣄલ܇࿅ࡁΈݴޠϞσϧͰͷจੜ੒
    • GPT-2Λ࢖ͬͯద౰ʹจষΛੜ੒ͯ͠ΈΔ
    • ୳ࡧʹ͸Beam SearchΛ࢖ͬͯΈΔ
    An unprecedented number of
    mostly young whales have
    become stranded on the West
    Australian coast since 2008.
    The number of stranded whales has
    increased by more than 50 per cent in the
    past year, with the number of stranded
    whales on the West Australian coast
    increasing by more than 50 per cent in the
    past year. The number of whales stranded
    on the West Australian coast has increased
    by more than 50 per cent in the past year.
    GPT-2
    w/ beam search
    ͳΜ͔ͩ܁Γฦ͕͠ଟ͍ؾ͕͢Δ
    9

    View full-size slide

  8. ܁Γฦ͕͠ଓ͘ݪҼΛௐ΂ͯΈΔ
    • I don’t knowΛ200ճ܁Γฦͨ͠ܥྻΛ༩͑ͯΈΔ
    • ܁Γฦ͢͝ͱʹੜ੒֬཰্͕͕͍ͬͯ͘
    • I don’t know͕΍ΊΒΕͳ͍ͱ·Βͳ͍ঢ়ଶ
    Published as a conference paper at ICLR 2020
    probability
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    I do n’t know .
    Figure 4: The probability of a repeated phrase increases with each repetition, creating a positive
    feedback loop. We found this effect to hold for the vast majority of phrases we tested, regardless of
    phrase length or if the phrases were sampled randomly rather than taken from human text.
    Figure 5: The probability mass assigned to partial human sentences. Flat distributions lead to many
    moderately probable tokens, while peaked distributions concentrate most probability mass into just
    τʔΫϯͷ
    ༧ଌ֬཰
    10

    View full-size slide

  9. վળҊ1: SamplingͰੜ੒ͯ͠ΈΔ
    • Beam searchͰ֬཰ͷߴ͍ܥྻΛબͿͷ͕ྑ͘ͳ͍ʁ
    • SamplingͰੜ੒ͯ͠ΈΔ
    • Ϟσϧͷग़ͨ͠୯ޠ֬཰෼෍͔Βαϯϓϧ͢Δ͚ͩ
    An unprecedented number of
    mostly young whales have
    become stranded on the West
    Australian coast since 2008.
    The Australian Food Safety Authority has
    warned Australia's beaches may be
    revitalised this year because healthy
    seabirds and seals have been on the move.
    More than 50,000 seabirds, sea mammals
    and seahorses have been swept into the
    sea by the Holden CS118 and Adelaide
    Airport CS300 from 2013.
    GPT-2
    w/ sampling
    จ຺ʹؔ܎ͷͳ͍ϑϨʔζ͕ग़ͯ͠·͏
    11

    View full-size slide

  10. վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ
    • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏
    • Top-k samplingͰੜ੒ͯ͠ΈΔ
    • ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά
    Figure 4: The probability of a repeated phrase increases with each repetition, creati
    feedback loop. We found this effect to hold for the vast majority of phrases we tested,
    phrase length or if the phrases were sampled randomly rather than taken from human t
    Figure 5: The probability mass assigned to partial human sentences. Flat distributions
    Figure 4: The probability of a repeated phrase increases with each repetition, crea
    feedback loop. We found this effect to hold for the vast majority of phrases we teste
    phrase length or if the phrases were sampled randomly rather than taken from human
    Figure 5: The probability mass assigned to partial human sentences. Flat distribution
    Top-15͔ΒͷαϯϓϦϯάͷ༷ࢠ
    12

    View full-size slide

  11. վળҊ2: Top-kͰੜ੒ͯ͠ΈΔ
    • Sampling: શؔ͘܎ͷͳ͍୯ޠ΋બ͹Εͯ͠·͏
    • Top-k samplingͰੜ੒ͯ͠ΈΔ
    • ୯ޠ֬཰෼෍ͷ্Ґk୯ޠ͔ΒͷαϯϓϦϯά
    An unprecedented number of
    mostly young whales have
    become stranded on the West
    Australian coast since 2008.
    Pumping Station #3 shut down due to
    construction damage Find more at:
    www.abc.net.au/environment/species-
    worry/ in-the-top-10-killer-whale-
    catastrophes-in-history.html “In the top 10
    killer whale catastrophes in history:
    1) 1986: Up to 12 orcas struck by lightning;
    many drowned and many more badly
    injured.
    GPT-2
    w/ top-k sampling
    ܁Γฦ͠ੜ੒ͱؔ܎ͳ͍ϑϨʔζͷੜ੒ͷ߹Θٕͤঢ়ଶ
    13

    View full-size slide

  12. Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ
    • ༧ଌ֬཰෼෍͕ϑϥοτͳ৔߹: େ͖͍kΛ࢖͍͍ͨ
    • k͕খ͍͞ͱɺࣅͨΑ͏ͳ୯ޠ͹͔Γग़͢͜ͱʹͳΓ͕ͪ
    • ৭ʑͳ୯ޠΛग़ͤΔΑ͏ʹ͓͖͍ͯͨ͠
    Figure 4: The probability of a repeated phrase increases w
    feedback loop. We found this effect to hold for the vast maj
    phrase length or if the phrases were sampled randomly rathe
    14

    View full-size slide

  13. Top-kͷ໰୊఺: k͕ݻఆ͞Ε͍ͯΔ
    • ༧ଌ֬཰෼෍͕ϐʔΩʔͳ৔߹: খ͍͞kΛ࢖͍͍ͨ
    • k͕େ͖͍ͱɺ֬཰ͷখ͍͞୯ޠΛબΜͰ͠·͏
    ةݥੑ͕͋Δ
    of a repeated phrase increases with each repetition, creating a positive
    is effect to hold for the vast majority of phrases we tested, regardless of
    es were sampled randomly rather than taken from human text.
    mass assigned to partial human sentences. Flat distributions lead to many
    hot͔cooling
    ͘Β͍͕
    બ͹Εͯ
    ཉ͍͠ؾ࣋ͪ
    15

    View full-size slide

  14. ఏҊख๏: Nucleus Sampling
    • ΞΠσΞɿಈతʹkΛܾΊΕ͹͍͍ͷͰ͸ʁ
    • top-p vocabulary (")Λ࢖͏͜ͱʹ͢Δ
    • ("): ҎԼͷෆ౳ࣜΛຬͨ͢Α͏ͳͷ࠷খ෦෼ू߹
    • ؾ࣋ͪᶃ: ෼෍͕ϑϥοτͳͱ͖͸ (")ͷதʹ୯ޠ͕ͨ͘͞Μ
    • ؾ࣋ͪᶄ: ෼෍͕ϐʔΩʔͳͱ͖͸ (")ͷதͷ୯ޠ͸গͳΊ
    stic decoding method: Nucleus Sampling. The ke
    tion to determine the set of tokens to be sampled f
    s top-p vocabulary V (p) ⇢ V as the smallest set s
    X
    x2V (p)
    P(x|x1:i 1) p.
    4
    ͖͍͠஋ʢϋΠύϥʣ
    ྫ: 0.95
    pΛ௒͑Δ·Ͱ
    ୯ޠΛ٧Ί͍ͯ͘
    16

    View full-size slide

  15. Nucleus SamplingͰੜ੒ͯ͠Έͨ
    • ܁Γฦ͠໰୊͕ղܾͨ͠ʢΑ͏ʹݟ͑Δʣ
    • ؔ܎ͳ͍୯ޠ͕ੜ੒͞Ε͍ͯͳ͍ʢΑ͏ʹݟ͑Δʣ
    An unprecedented number of
    mostly young whales have
    become stranded on the West
    Australian coast since 2008.
    There has been an unprecedented number
    of calves caught in the nets of whaling
    stations that operate in WA. Pilot whales
    continue to migrate to feeding grounds to
    feed their calves. They are now vulnerable
    due to the decline of wild populations;
    they are restricted to one breeding site
    each year. Image copyright Yoon Bo Kim
    But, with sharp decline in wild populations
    the size of the Petrels are shrinking and
    dwindling population means there will only
    be room for a few new fowl.
    GPT-2
    w/ nucleus sampling
    17

    View full-size slide

  16. ࣮ݧઃఆ
    • λεΫ: จ຺෇͖ੜ੒
    • ೖྗɿจষͷઌ಄1~40τʔΫϯ
    • ग़ྗɿจষͷ࢒ΓΛੜ੒
    • σʔλɿWeb͔Β͖࣋ͬͯͨ 5000 passages
    • Ϟσϧ: GPT-2 (large)
    • 40GB ͷςΩετϑΝΠϧͰֶशͨ͠΋ͷ
    An unprecedented number of
    mostly young whales have
    become stranded on the West
    Australian coast since 2008.
    There has been an unprecedented number
    of calves caught in the nets of whaling
    stations that operate in WA. Pilot whales
    continue to migrate to feeding grounds to
    feed their calves. They are now vulnerable
    due to the decline of wild populations;
    they are restricted to one breeding site
    each year. Image copyright Yoon Bo Kim
    But, with sharp decline in wild populations
    the size of the Petrels are shrinking and
    dwindling population means there will only
    be room for a few new fowl.
    ͜ΕΛ֤ख๏ɾ֤จষͰ΍Δ͜ͱʹ૬౰
    18

    View full-size slide

  17. ࣮ݧ݁Ռ
    Method Perplexity Self-BLEU
    Repetition
    (%)
    HUSE
    Human 12.38 0.31 0.28 n/a
    Beam Search
    (b=16)
    1.48 0.44 28.94 n/a
    Sampling 22.73 0.28 0.22 0.67
    Top-k (k=40) 6.88 0.39 0.78 0.19
    Top-k (k=640) 13.82 0.32 0.28 0.94
    Nucleus
    (p=0.95)
    13.13 0.32 0.36 0.97
    19

    View full-size slide

  18. ࣮ݧ݁Ռͷݟํᶃ: Human
    Method Perplexity Self-BLEU
    Repetition
    (%)
    HUSE
    Human 12.38 0.31 0.28 n/a
    Beam Search
    (b=16)
    1.48 0.44 28.94 n/a
    Sampling 22.73 0.28 0.22 0.67
    Top-k (k=40) 6.88 0.39 0.78 0.19
    Top-k (k=640) 13.82 0.32 0.28 0.94
    Nucleus
    (p=0.95)
    13.13 0.32 0.36 0.97
    ݩίʔύεͷจʢਓ͕ؒॻ͍ͨจʣ
    20

    View full-size slide

  19. ࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ
    Method Perplexity Self-BLEU
    Repetition
    (%)
    HUSE
    Human 12.38 0.31 0.28 n/a
    Beam Search
    (b=16)
    1.48 0.44 28.94 n/a
    Sampling 22.73 0.28 0.22 0.67
    Top-k (k=40) 6.88 0.39 0.78 0.19
    Top-k (k=640) 13.82 0.32 0.28 0.94
    Nucleus
    (p=0.95)
    13.13 0.32 0.36 0.97
    ஋͕)VNBOʹ͍ۙ΄Ͳྑ͍
    ͍ͭ΋ͷ
    1FSQMFYJUZ
    ࣗ෼ͷੜ੒݁Ռ
    ͱͷ#-&6
    %JWFSTJUZͷܭଌ
    ܁Γฦ͠Λ
    ؚΉจͷׂ߹
    21

    View full-size slide

  20. ࣮ݧ݁Ռͷݟํᶄ: ධՁࢦඪ
    Method Perplexity Self-BLEU
    Repetition
    (%)
    HUSE
    Human 12.38 0.31 0.28 n/a
    Beam Search
    (b=16)
    1.48 0.44 28.94 n/a
    Sampling 22.73 0.28 0.22 0.67
    Top-k (k=40) 6.88 0.39 0.78 0.19
    Top-k (k=640) 13.82 0.32 0.28 0.94
    Nucleus
    (p=0.95)
    13.13 0.32 0.36 0.97
    ஋͕ߴ͍΄Ͳྑ͍
    ͬ͘͟Γݴ͏ͱ
    ਓखධՁʹ૬౰
    22

    View full-size slide

  21. ࣮ݧ݁Ռ: Nucleus Sampling͸ੌ͍ͷ͔ʁ
    Method Perplexity Self-BLEU
    Repetition
    (%)
    HUSE
    Human 12.38 0.31 0.28 n/a
    Beam Search
    (b=16)
    1.48 0.44 28.94 n/a
    Sampling 22.73 0.28 0.22 0.67
    Top-k (k=40) 6.88 0.39 0.78 0.19
    Top-k (k=640) 13.82 0.32 0.28 0.94
    Nucleus
    (p=0.95)
    13.13 0.32 0.36 0.97
    • Nucleus͸֤ධՁࢦඪͰྑ͍஋Λग़͍ͯ͠ΔʢΑ͏ʹݟ͑Δʣ
    • ಛʹHUSEʢਓखධՁʣͰ͸࠷ߴੑೳ
    • Top-k (k=640) ͱ΄ͱΜͲ͕ࠩͳ͍ʁ
    • Top-kΛ࢖͑͹ྑ͍ͷͰ͸ʁ
    • ஶऀ͍Θ͘ɺk=100Ҏ্͸ʮී௨ʯͰ͸ͳ͍஋ͱͷ͜ͱ
    23

    View full-size slide

  22. ʢ࠶ܝʣͲΜͳ࿦จ͔ʁ
    • എܠɾ໰୊
    • ࣄલ܇࿅ࡁΈݴޠϞσϧʢGPT-2ʣͰͷςΩετੜ੒
    ͰɺDegenerationʢୀԽੜ੒ʣ͕ੜ͡Δ
    • Degeneration: ܁Γฦ͠ੜ੒ ˍ Ұ؏ੑͷͳ͍ੜ੒
    • ΞΠσΞ
    • σίʔυํ๏Λ޻෉ͯ͠degenerationΛ๷͙
    • Nucleus Sampling: Top-k samplingͷkΛಈతʹಈ͔͢
    • ߩݙ
    • DegenerationͷଘࡏΛࣔͨ͠
    • Nucleus Samplingͷྑ͞Λఆੑతɾఆྔతʹࣔͨ͠
    24

    View full-size slide

  23. ಡΜͩײ૝
    • GPT-2ͷΑ͏ͳݴޠϞσϧͰ΋degeneration͕
    ى͜Δɺͱ͍͏ͷ͸໘ന͍
    • ݱ৅ͱͯ͠஌͓͍ͬͯͯྑͦ͞͏
    • ʮݴޠϞσϧʯҰൠʹͲΕ͘Β͍௨༻͢Δ࿩
    ͳͷ͔Ṗ
    • খن໛ͳσʔλɾϞσϧͰ΋ಉ͡໰୊͕ى͖Δͷ͔ʁ
    • GPT-3Ͱ͸Ͳ͏ͳͷ͔ʁ
    • Nucleus sampling͸ख๏ͱͯ͠ྑͦ͞͏ͳ΋ͷͷ
    Α͘ग़དྷͨώϡʔϦεςΟοΫͱ͍͏ҹ৅
    • ֶशํ๏͔Βݟ௚͍ͨ͠ؾ࣋ͪ à unlikelihood loss?
    25

    View full-size slide