Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reservoir Transformers

Shun Kiyono
September 07, 2021

Reservoir Transformers

最先端NLP勉強会で発表予定のスライドです

Shun Kiyono

September 07, 2021
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ಡΉਓ
    ཧݚ"*1 ౦๺େֶ סݚڀࣨ
    ਗ਼໺ ॢ
    Reservoir Transformers
    Sheng Shen†, Alexei Baevski‡, Ari S. Morcos‡, Kurt Keutzer†,
    Michael Auli‡, Douwe Kiela‡
    †UC Berkeley; ‡Facebook AI Research
    [email protected], [email protected]
    Abstract
    We demonstrate that transformers obtain im-
    pressive performance even when some of the
    layers are randomly initialized and never up-
    is more, we find that freezing layers may actually
    improve performance.
    Beyond desirable efficiency gains, random lay-
    ers are interesting for several additional reasons.
    ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢

    View full-size slide

  2. ͲΜͳ࿦จ͔ʁ
    • എܠɾ໰୊
    • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠
    • ΞΠσΞ
    • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ
    • ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ
    • ߩݙ
    • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ
    • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 4

    View full-size slide

  3. ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅
    • All you need is Transformer?
    • ߴύϑΥʔϚϯε
    • ݴޠɺը૾ɺԻ੠Ͱେྲྀߦ
    • Vanillaઃఆ͕ڧ͗͢Δ
    • Vanillaͱѥछʹ΄ͱΜͲࠩ͸ͳ͍ [Narang+2021]
    • ॳొ৔͔ΒมΘͬͨͷ͸Layer-normalizationͷҐஔ͘Β͍ʁ
    • Scaling Law [Kaplan+2020]
    • σʔλ͕ଟ͚Ε͹ੑೳUP
    • Ϟσϧ͕େ͖͚Ε͹ੑೳUP
    September 17, 2021 ࠷ઌ୺NLPษڧձ 5
    https://ruder.io/research-highlights-2020/
    Transformer
    Dataset Size
    tokens
    Parameters
    non-embedding
    Compute
    PF-days, non-embedding
    Test Loss
    Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset
    size, and amount of compute2 used for training. For optimal performance all three factors must be scaled
    up in tandem. Empirical performance has a power-law relationship with each individual factor when not
    bottlenecked by the other two.
    [Kaplan+2020]

    View full-size slide

  4. ڊେχϡʔϥϧωοτͷ܇࿅͸ͭΒ͍
    • ͦ΋ͦ΋ϋʔυ΢ΣΞʢGPUʣ͕଍Γͳ͍
    • 32GBΧʔυͷݶք͸1.4Bύϥϝλ [Ren+2021]
    • ܭࢉࢿݯͷݶΒΕͨݚڀάϧʔϓͰ͸܇࿅͕Ͱ͖ͳ͍
    • ܇࿅ʹ͕͔͔࣌ؒΓ͗͢Δ
    • GPT-3ɿ1ຕͷGPUͰ355೥ʢΒ͍͠ʣ
    • Transformer (big)ɿDGX-2Λ࢖ͬͯҰ೔
    • ίϯϖͷ৔߹͸16ϞσϧඞཁͩͬͨΓ͢Δ à 16೔ʁʁʁ
    • ؀ڥ΁ͷෛՙʢԹࣨޮՌΨεʣ΋ແࢹͰ͖ͳ͍
    September 17, 2021 ࠷ઌ୺NLPษڧձ 6
    ࠓճ͸ͬͪ͜ͷղܾΛ໨ࢦ͢

    View full-size slide

  5. ͓͞Β͍ɿTransformer
    • ಉ͡େ͖͞ͷ૚ΛੵΜͰੑೳ޲্ΛਤΔͷ͕ओྲྀ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 7
    Transformer-layer 🔥
    Transformer-layer 🔥
    Transformer-layer🔥
    Transformer-layer🔥
    Transformer-layer🔥
    Multi-Head
    Attention
    Add & Norm
    Feed
    Forward
    Add & Norm
    Input
    +

    View full-size slide

  6. ΞΠσΞɿϨΠϠʔΛߋ৽͠ͳ͍
    • Ұ෦ͷϨΠϠʔ͸ॳظ஋Ͱݻఆ❄
    • ଞͷϨΠϠʔ͸௨ৗ௨Γߋ৽🔥
    • ߋ৽͢Δ૚ͱ͠ͳ͍૚Λࠞͥͯ഑ஔ
    • ֤ݻఆ૚͕ͳΔ΂͘཭ΕΔΑ͏ʹ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 8
    Transformer-layer 🔥
    Transformer-layer ❄
    Transformer-layer🔥
    Transformer-layer❄
    Transformer-layer🔥
    ஫ɿ͜ͷίʔυ͸ಈ͖·ͤΜ

    View full-size slide

  7. ఏҊख๏ɿResavoir Transformer
    • Resavoir Transformer
    • ॳظ஋Ͱݻఆ͞ΕͨϨΠϠʔ❄Λ
    ؚΉΑ͏ͳTransformer
    • ఏҊख๏ͷؾ࣋ͪ
    • ύϥϝʔλߋ৽ΛαϘΕΔͷͰ
    ߴ଎Խ
    • ॳظ஋ϨΠϠʔ͕ಛ௃நग़Λ
    ؤுͬͯ͘ΕΔ͜ͱΛفΔ
    • ѱ͘ݴ͑͹΍ͬͯΈͨܥͷ࿩
    September 17, 2021 ࠷ઌ୺NLPษڧձ 9
    Transformer-layer 🔥
    Transformer-layer ❄
    Transformer-layer🔥
    Transformer-layer❄
    Transformer-layer🔥

    View full-size slide

  8. ϨβϘΞͷ͖΋ͪ
    • χϡʔϥϧωοτϫʔΫͬͯద౰ͩ͠ɺ
    ϊΠζతʹ͏·͘ಈ࡞ͯ͘͠ΕΔΜ͡Όͳ͍…ʁ
    • Residual઀ଓͷ͓͔͛Ͱେৎ෉ʁ
    • ݻఆϨΠϠʔ❄͕༗ӹͳಛ௃Λ࡞Δ͜ͱΛظ଴͢Δ
    • ࡞Εͳ͍৔߹ɿֶशϨΠϠʔ🔥 ͷϊϧϜ > ݻఆϨΠϠʔ❄ ͷ
    ϊϧϜͱ͢Ε͹ɺݻఆϨΠϠʔ͸ແࢹͰ͖Δʁ
    • ྨࣅʁख๏ɿAdapter
    • ֶशࡁΈͷTransformerʹখ͞ͳϞδϡʔϧΛ
    ௥Ճͯ͠finetuning
    • ϥϯμϜॳظԽ෦෼͸߃౳ࣸ૾͔Βελʔτ͢ΔͨΊ
    finetuneݩͷ৘ใ͸ഁյ͞Εͳ͍
    • Residual઀ଓͷҐஔ͕ඇৗʹॏཁΒ͍͠
    • [Pfeiffer+2020]
    September 17, 2021 ࠷ઌ୺NLPษڧձ 10
    ॳظ஋ͷ··ݻఆͯ͠ຊ౰ʹେৎ෉ͳͷ͔ʁ
    2.2.1 Single-task Adapters
    For each of the N tasks, the model is initialized
    with parameters ⇥0. In addition, a set of new and
    randomly initialized parameters n are introduced
    (the adapter parameters). To share the same set
    of parameters ⇥0 across all otherwise independent
    tasks, the parameters in ⇥0 are fixed and only the
    parameters n are trained. This makes it possible
    to efficiently parallelize the training of adapters
    for all N tasks. The objective for each task n 2
    1, . . . , N is of the form:
    n argmin Ln(Dn; ⇥0, )
    For common adapter architectures, contains
    considerably fewer parameters than ⇥, e.g., only
    3.6% of the parameters of the pre-trained model in
    (Houlsby et al., 2019).
    Feed
    Forward
    Add & Norm
    Adapter
    Multi-Head
    Attention
    Add & Norm
    Add & Norm
    LayerNorm
    LayerNorm
    FF Down
    FF Up
    Add & Norm
    Feed
    Forward
    Multi-Head
    Attention
    Add & Norm
    Add & Norm
    FF Down
    FF Up
    Add & Norm
    Adapter
    Adapter
    Figure 2: Different architectural components of the
    adapter. On the left, we show all components for which

    View full-size slide

  9. ݻఆϨΠϠʔ ❄ ͷطଘݚڀ
    • ॳظ஋ݻఆͷLSTM͕ڧ͍จΤϯίʔμͰ͋Δ
    [Wieting+2019]
    • Skip-thought΍Infersentͱͷ͕ࠩখ͍͜͞ͱΛࣔͨ͠
    • ϥϯμϜࣸ૾Λ༻͍ͨ࣍ݩ࡟ݮ [Sahlgren 2005]
    • ෦෼ۭؒ΁ͷࣸ૾͕ݩۭؒͰͷڑ཭Λอ࣋
    September 17, 2021 ࠷ઌ୺NLPษڧձ 11
    Published as a conference paper at ICLR 2019
    Model Dim MR CR MPQA SUBJ SST2 TREC SICK-R SICK-E MRPC STSB
    BOE 300 77.3(.2) 78.6(.3) 87.6(.1) 91.3(.1) 80.0(.5) 81.5(.8) 80.2(.1) 78.7(.1) 72.9(.3) 70.5(.1)
    BOREP 4096 77.4(.4) 79.5(.2) 88.3(.2) 91.9(.2) 81.8(.4) 88.8(.3) 85.5(.1) 82.7(.7) 73.9(.4) 68.5(.6)
    RandLSTM 4096 77.2(.3) 78.7(.5) 87.9(.1) 91.9(.2) 81.5(.3) 86.5(1.1) 85.5(.1) 81.8(.5) 74.1(.5) 72.4(.5)
    ESN 4096 78.1(.3) 80.0(.6) 88.5(.2) 92.6(.1) 83.0(.5) 87.9(1.0) 86.1(.1) 83.1(.4) 73.4(.4) 74.4(.3)
    InferSent-1 = paper, G 4096 81.1 86.3 90.2 92.4 84.6 88.2 88.3 86.3 76.2 75.6
    InferSent-2 = fixed pad, F 4096 79.7 84.2 89.4 92.7 84.3 90.8 88.8 86.3 76.0 78.4
    InferSent-3 = fixed pad, G 4096 79.7 83.4 88.9 92.6 83.5 90.8 88.5 84.1 76.4 77.3
    InferSent-3, BestRand - 1.6 3.4 0.4 0.0 0.5 2.0 2.4 1.0 2.3 2.9
    ST-LN 4800 79.4 83.1 89.3 93.7 82.9 88.4 85.8 79.5 73.2 68.9
    ST-LN, BestRand - 1.3 3.1 0.8 1.1 -0.1 0.5 -0.3 -3.6 -0.9 -5.5
    Table 1: Performance (accuracy for all tasks except SICK-R and STSB, for which we report Pear-
    son’s r) on all ten downstream tasks where all models have 4096 dimensions with the exception of
    BOE (300) and ST-LN (4800). Standard deviations are show in parentheses. InferSent-1 is the paper
    [Wieting+2019]

    View full-size slide

  10. ࣮ݧઃఆ
    • ໨త
    • Reservoir TransformerͷޮՌΛݕূ
    • ϕʔεϥΠϯ
    • Transformer
    • λεΫ
    • ػց຋༁ʢ8.5*84-5ʣ
    • ݴޠϞσϧʢenwiki8ʣ
    • ૒ํ޲ݴޠϞσϧʢMNLIɺSST-2ͳͲʣ
    • ධՁࢦඪ
    • BLEUʢࢀর༁ͱͷҰக཰ʣ
    • AUCCʢAUC-ROCͷֶश଎౓൛; ޙड़ʣ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 12
    ࣌ؒͷ౎߹্͚ͩ͜͜

    View full-size slide

  11. ධՁࢦඪɿAUCC
    September 17, 2021 ࠷ઌ୺NLPษڧձ 13
    ResavoirԽʹΑΔֶश଎౓΁ͷӨڹΛௐ΂͍ͨ
    • AUCC: area under the convergence curve
    • ֶशۂઢΛੵ෼ͯ͠ܭࢉ
    • ஋͕େ͖͍ → ऩଋ͕଎͍ ʢand ੑೳ͕ྑ͍ʣ
    ͜͜Λੵ෼
    ܇࿅࣌ؒ
    BLEU
    ద౰ͳ෯Ͱଧͪ੾Γ

    View full-size slide

  12. ࣮ݧ݁Ռ on ଎౓
    September 17, 2021 ࠷ઌ୺NLPษڧձ 14
    Figure 8: Validation BLEU AUCC an
    IWSLT (high is good). Comparison
    former and reservoir transformer with
    former reservoir layers added.
    • Τϯίʔμͷ૚਺͕…
    • খ͍͞ͱ͖ɿϕʔεϥΠϯͷ΄͏͕ߴ଎ʁ
    • େ͖͍ͱ͖ɿఏҊख๏ͷ΄͏͕ߴ଎ʁ
    • Ұ؏ͯ͠ྑ͍ख๏͸ଘࡏ͠ͳ͍
    • ֶश࣌ؒΛ1ׂݮΒͤΔ͔΋ɺͱ͍͏ఔ౓
    Կ૚ݻఆ
    ͍ͯ͠Δ͔͸Ṗ
    Ͳ͜ʹ఺͕
    ଧͨΕ͍ͯΔ͔΋Ṗ

    View full-size slide

  13. ࣮ݧ݁Ռ on BLEU
    • Resavoir͸ϨΠϠʔ਺͕ଟ͍ͱ͖ʹޮՌ͋Γʁ
    • ਖ਼ଇԽͱͯ͠ಈ࡞ʁ
    • Ϟσϧ͕Over-parametrize͞Εͨঢ়گԼͰ͸༗ޮ͔΋
    • ੑೳ͸΄ͱΜͲಉ͡ɺͱ͍͏ͷ͕੣࣮ͦ͏
    September 17, 2021 ࠷ઌ୺NLPษڧձ 15
    ion BLEU AUCC and test BLEU for
    good). Comparison of regular trans-
    voir transformer with FFN or Trans-
    layers added.
    Figure 1
    enwik8
    parison
    ing dep
    Կ૚ݻఆͯ͠
    ͍Δ͔͸Ṗ

    View full-size slide

  14. ؾʹͳΔ͜ͱɿϕʔεϥΠϯ
    • ϕʔεϥΠϯͷTransformerͷܗ͕͓͔͍͠
    • Encoder N૚ʹରͯ͠Decoder 1૚ [Kasai et al. 2020]
    • ී௨͸6૚ɿ6૚Ͱ࣮ݧ
    • ஶऀʮߴ଎ͳઃఆ͔ͩΒ࠾༻ͨ͠ʯ
    • ଥ౰ͳઃఆͰ͸ͳ͍ؾ͕͢Δ…
    • ੈͷதͷΤϯίʔμɾσίʔμͱ͔͚཭Εͨઃఆ
    • 2ͭͷ଎౓ͷ࿩͕औΓҧ͑ΒΕ͍ͯΔʁ
    • [Kasai et al. 2020]ɿσίʔυͷߴ଎Խͷ࿩
    • ຊ࿦จɿ܇࿅ͷߴ଎Խͷ࿩
    • ͜͜ΛऔΓҧ͑ͯٞ࿦ͯ͠Δ࣌఺Ͱ͔ͳΓո͍͕͠…
    September 17, 2021 ࠷ઌ୺NLPษڧձ 16

    View full-size slide

  15. ϕʔεϥΠϯͷܗ͕ͳͥॏཁ͔ʁ
    • ࣮ݧͷ஌ݟ͕৴༻Ͱ͖ͳ͍
    • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ
    • 6:6ઃఆͷ΄͏͕ऩଋ͕଎͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ
    • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ
    • 6:6ઃఆͷ΄͏͕ੑೳ͕ߴ͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ
    • ඞཁͳ஌ݟ͕ಘΒΕ͍ͯͳ͍
    • DecoderଆͰReservoir͢Δ͜ͱ͸Մೳ͔ʁ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 17

    View full-size slide

  16. ϕʔεϥΠϯͷௐࠪ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 18
    [Xu+2021] ͷ௥ࢼͱরΒ͠߹ΘͤͯΈͨ
    • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ à ଟ෼0,
    • 😃 σίʔμΛઙͨ͘͠΄͏͕܇࿅͕଎͍
    • 😰 ͨͩɺऩଋ͕଎͍͔Ͳ͏͔͸Θ͔Βͳ͍
    • AUCC஋͕ߴ͍͔Ͳ͏͔͸Θ͔Βͳ͍
    • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ à /(
    • 😨 ઃఆͷ΄͏͕ߴੑೳʹݟ͑Δ
    [Xu+2021]

    View full-size slide

  17. ·ͱΊʢ࠶ܝʣ
    • എܠɾ໰୊
    • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠
    • ΞΠσΞ
    • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ
    • ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ
    • ߩݙ
    • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ
    • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ
    September 17, 2021 ࠷ઌ୺NLPษڧձ 19

    View full-size slide