Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reservoir Transformers

Shun Kiyono
September 07, 2021

Reservoir Transformers

最先端NLP勉強会で発表予定のスライドです

Shun Kiyono

September 07, 2021
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ಡΉਓ ཧݚ"*1 ౦๺େֶ סݚڀࣨ ਗ਼໺ ॢ Reservoir Transformers Sheng Shen†,

    Alexei Baevski‡, Ari S. Morcos‡, Kurt Keutzer†, Michael Auli‡, Douwe Kiela‡ †UC Berkeley; ‡Facebook AI Research [email protected], [email protected] Abstract We demonstrate that transformers obtain im- pressive performance even when some of the layers are randomly initialized and never up- is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random lay- ers are interesting for several additional reasons. ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢
  2. ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠ • ΞΠσΞ • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ •

    ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ • ߩݙ • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 4
  3. ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ • All you need is Transformer? • ߴύϑΥʔϚϯε •

    ݴޠɺը૾ɺԻ੠Ͱେྲྀߦ • Vanillaઃఆ͕ڧ͗͢Δ • Vanillaͱѥछʹ΄ͱΜͲࠩ͸ͳ͍ [Narang+2021] • ॳొ৔͔ΒมΘͬͨͷ͸Layer-normalizationͷҐஔ͘Β͍ʁ • Scaling Law [Kaplan+2020] • σʔλ͕ଟ͚Ε͹ੑೳUP • Ϟσϧ͕େ͖͚Ε͹ੑೳUP September 17, 2021 ࠷ઌ୺NLPษڧձ 5 https://ruder.io/research-highlights-2020/ Transformer Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. [Kaplan+2020]
  4. ڊେχϡʔϥϧωοτͷ܇࿅͸ͭΒ͍ • ͦ΋ͦ΋ϋʔυ΢ΣΞʢGPUʣ͕଍Γͳ͍ • 32GBΧʔυͷݶք͸1.4Bύϥϝλ [Ren+2021] • ܭࢉࢿݯͷݶΒΕͨݚڀάϧʔϓͰ͸܇࿅͕Ͱ͖ͳ͍ • ܇࿅ʹ͕͔͔࣌ؒΓ͗͢Δ

    • GPT-3ɿ1ຕͷGPUͰ355೥ʢΒ͍͠ʣ • Transformer (big)ɿDGX-2Λ࢖ͬͯҰ೔ • ίϯϖͷ৔߹͸16ϞσϧඞཁͩͬͨΓ͢Δ à 16೔ʁʁʁ • ؀ڥ΁ͷෛՙʢԹࣨޮՌΨεʣ΋ແࢹͰ͖ͳ͍ September 17, 2021 ࠷ઌ୺NLPษڧձ 6 ࠓճ͸ͬͪ͜ͷղܾΛ໨ࢦ͢
  5. ͓͞Β͍ɿTransformer • ಉ͡େ͖͞ͷ૚ΛੵΜͰੑೳ޲্ΛਤΔͷ͕ओྲྀ September 17, 2021 ࠷ઌ୺NLPษڧձ 7 Transformer-layer 🔥

    Transformer-layer 🔥 Transformer-layer🔥 Transformer-layer🔥 Transformer-layer🔥 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input +
  6. ΞΠσΞɿϨΠϠʔΛߋ৽͠ͳ͍ • Ұ෦ͷϨΠϠʔ͸ॳظ஋Ͱݻఆ❄ • ଞͷϨΠϠʔ͸௨ৗ௨Γߋ৽🔥 • ߋ৽͢Δ૚ͱ͠ͳ͍૚Λࠞͥͯ഑ஔ • ֤ݻఆ૚͕ͳΔ΂͘཭ΕΔΑ͏ʹ September

    17, 2021 ࠷ઌ୺NLPษڧձ 8 Transformer-layer 🔥 Transformer-layer ❄ Transformer-layer🔥 Transformer-layer❄ Transformer-layer🔥 ஫ɿ͜ͷίʔυ͸ಈ͖·ͤΜ
  7. ఏҊख๏ɿResavoir Transformer • Resavoir Transformer • ॳظ஋Ͱݻఆ͞ΕͨϨΠϠʔ❄Λ ؚΉΑ͏ͳTransformer • ఏҊख๏ͷؾ࣋ͪ

    • ύϥϝʔλߋ৽ΛαϘΕΔͷͰ ߴ଎Խ • ॳظ஋ϨΠϠʔ͕ಛ௃நग़Λ ؤுͬͯ͘ΕΔ͜ͱΛفΔ • ѱ͘ݴ͑͹΍ͬͯΈͨܥͷ࿩ September 17, 2021 ࠷ઌ୺NLPษڧձ 9 Transformer-layer 🔥 Transformer-layer ❄ Transformer-layer🔥 Transformer-layer❄ Transformer-layer🔥
  8. ϨβϘΞͷ͖΋ͪ • χϡʔϥϧωοτϫʔΫͬͯద౰ͩ͠ɺ ϊΠζతʹ͏·͘ಈ࡞ͯ͘͠ΕΔΜ͡Όͳ͍…ʁ • Residual઀ଓͷ͓͔͛Ͱେৎ෉ʁ • ݻఆϨΠϠʔ❄͕༗ӹͳಛ௃Λ࡞Δ͜ͱΛظ଴͢Δ • ࡞Εͳ͍৔߹ɿֶशϨΠϠʔ🔥

    ͷϊϧϜ > ݻఆϨΠϠʔ❄ ͷ ϊϧϜͱ͢Ε͹ɺݻఆϨΠϠʔ͸ແࢹͰ͖Δʁ • ྨࣅʁख๏ɿAdapter • ֶशࡁΈͷTransformerʹখ͞ͳϞδϡʔϧΛ ௥Ճͯ͠finetuning • ϥϯμϜॳظԽ෦෼͸߃౳ࣸ૾͔Βελʔτ͢ΔͨΊ finetuneݩͷ৘ใ͸ഁյ͞Εͳ͍ • Residual઀ଓͷҐஔ͕ඇৗʹॏཁΒ͍͠ • [Pfeiffer+2020] September 17, 2021 ࠷ઌ୺NLPษڧձ 10 ॳظ஋ͷ··ݻఆͯ͠ຊ౰ʹେৎ෉ͳͷ͔ʁ 2.2.1 Single-task Adapters For each of the N tasks, the model is initialized with parameters ⇥0. In addition, a set of new and randomly initialized parameters n are introduced (the adapter parameters). To share the same set of parameters ⇥0 across all otherwise independent tasks, the parameters in ⇥0 are fixed and only the parameters n are trained. This makes it possible to efficiently parallelize the training of adapters for all N tasks. The objective for each task n 2 1, . . . , N is of the form: n argmin Ln(Dn; ⇥0, ) For common adapter architectures, contains considerably fewer parameters than ⇥, e.g., only 3.6% of the parameters of the pre-trained model in (Houlsby et al., 2019). Feed Forward Add & Norm Adapter Multi-Head Attention Add & Norm Add & Norm LayerNorm LayerNorm FF Down FF Up Add & Norm Feed Forward Multi-Head Attention Add & Norm Add & Norm FF Down FF Up Add & Norm Adapter Adapter Figure 2: Different architectural components of the adapter. On the left, we show all components for which
  9. ݻఆϨΠϠʔ ❄ ͷطଘݚڀ • ॳظ஋ݻఆͷLSTM͕ڧ͍จΤϯίʔμͰ͋Δ [Wieting+2019] • Skip-thought΍Infersentͱͷ͕ࠩখ͍͜͞ͱΛࣔͨ͠ • ϥϯμϜࣸ૾Λ༻͍ͨ࣍ݩ࡟ݮ

    [Sahlgren 2005] • ෦෼ۭؒ΁ͷࣸ૾͕ݩۭؒͰͷڑ཭Λอ࣋ September 17, 2021 ࠷ઌ୺NLPษڧձ 11 Published as a conference paper at ICLR 2019 Model Dim MR CR MPQA SUBJ SST2 TREC SICK-R SICK-E MRPC STSB BOE 300 77.3(.2) 78.6(.3) 87.6(.1) 91.3(.1) 80.0(.5) 81.5(.8) 80.2(.1) 78.7(.1) 72.9(.3) 70.5(.1) BOREP 4096 77.4(.4) 79.5(.2) 88.3(.2) 91.9(.2) 81.8(.4) 88.8(.3) 85.5(.1) 82.7(.7) 73.9(.4) 68.5(.6) RandLSTM 4096 77.2(.3) 78.7(.5) 87.9(.1) 91.9(.2) 81.5(.3) 86.5(1.1) 85.5(.1) 81.8(.5) 74.1(.5) 72.4(.5) ESN 4096 78.1(.3) 80.0(.6) 88.5(.2) 92.6(.1) 83.0(.5) 87.9(1.0) 86.1(.1) 83.1(.4) 73.4(.4) 74.4(.3) InferSent-1 = paper, G 4096 81.1 86.3 90.2 92.4 84.6 88.2 88.3 86.3 76.2 75.6 InferSent-2 = fixed pad, F 4096 79.7 84.2 89.4 92.7 84.3 90.8 88.8 86.3 76.0 78.4 InferSent-3 = fixed pad, G 4096 79.7 83.4 88.9 92.6 83.5 90.8 88.5 84.1 76.4 77.3 InferSent-3, BestRand - 1.6 3.4 0.4 0.0 0.5 2.0 2.4 1.0 2.3 2.9 ST-LN 4800 79.4 83.1 89.3 93.7 82.9 88.4 85.8 79.5 73.2 68.9 ST-LN, BestRand - 1.3 3.1 0.8 1.1 -0.1 0.5 -0.3 -3.6 -0.9 -5.5 Table 1: Performance (accuracy for all tasks except SICK-R and STSB, for which we report Pear- son’s r) on all ten downstream tasks where all models have 4096 dimensions with the exception of BOE (300) and ST-LN (4800). Standard deviations are show in parentheses. InferSent-1 is the paper [Wieting+2019]
  10. ࣮ݧઃఆ • ໨త • Reservoir TransformerͷޮՌΛݕূ • ϕʔεϥΠϯ • Transformer

    • λεΫ • ػց຋༁ʢ8.5*84-5ʣ • ݴޠϞσϧʢenwiki8ʣ • ૒ํ޲ݴޠϞσϧʢMNLIɺSST-2ͳͲʣ • ධՁࢦඪ • BLEUʢࢀর༁ͱͷҰக཰ʣ • AUCCʢAUC-ROCͷֶश଎౓൛; ޙड़ʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 12 ࣌ؒͷ౎߹্͚ͩ͜͜
  11. ධՁࢦඪɿAUCC September 17, 2021 ࠷ઌ୺NLPษڧձ 13 ResavoirԽʹΑΔֶश଎౓΁ͷӨڹΛௐ΂͍ͨ • AUCC: area

    under the convergence curve • ֶशۂઢΛੵ෼ͯ͠ܭࢉ • ஋͕େ͖͍ → ऩଋ͕଎͍ ʢand ੑೳ͕ྑ͍ʣ ͜͜Λੵ෼ ܇࿅࣌ؒ BLEU ద౰ͳ෯Ͱଧͪ੾Γ
  12. ࣮ݧ݁Ռ on ଎౓ September 17, 2021 ࠷ઌ୺NLPษڧձ 14 Figure 8:

    Validation BLEU AUCC an IWSLT (high is good). Comparison former and reservoir transformer with former reservoir layers added. • Τϯίʔμͷ૚਺͕… • খ͍͞ͱ͖ɿϕʔεϥΠϯͷ΄͏͕ߴ଎ʁ • େ͖͍ͱ͖ɿఏҊख๏ͷ΄͏͕ߴ଎ʁ • Ұ؏ͯ͠ྑ͍ख๏͸ଘࡏ͠ͳ͍ • ֶश࣌ؒΛ1ׂݮΒͤΔ͔΋ɺͱ͍͏ఔ౓ Կ૚ݻఆ ͍ͯ͠Δ͔͸Ṗ Ͳ͜ʹ఺͕ ଧͨΕ͍ͯΔ͔΋Ṗ
  13. ࣮ݧ݁Ռ on BLEU • Resavoir͸ϨΠϠʔ਺͕ଟ͍ͱ͖ʹޮՌ͋Γʁ • ਖ਼ଇԽͱͯ͠ಈ࡞ʁ • Ϟσϧ͕Over-parametrize͞Εͨঢ়گԼͰ͸༗ޮ͔΋ •

    ੑೳ͸΄ͱΜͲಉ͡ɺͱ͍͏ͷ͕੣࣮ͦ͏ September 17, 2021 ࠷ઌ୺NLPษڧձ 15 ion BLEU AUCC and test BLEU for good). Comparison of regular trans- voir transformer with FFN or Trans- layers added. Figure 1 enwik8 parison ing dep Կ૚ݻఆͯ͠ ͍Δ͔͸Ṗ
  14. ؾʹͳΔ͜ͱɿϕʔεϥΠϯ • ϕʔεϥΠϯͷTransformerͷܗ͕͓͔͍͠ • Encoder N૚ʹରͯ͠Decoder 1૚ [Kasai et al.

    2020] • ී௨͸6૚ɿ6૚Ͱ࣮ݧ • ஶऀʮߴ଎ͳઃఆ͔ͩΒ࠾༻ͨ͠ʯ • ଥ౰ͳઃఆͰ͸ͳ͍ؾ͕͢Δ… • ੈͷதͷΤϯίʔμɾσίʔμͱ͔͚཭Εͨઃఆ • 2ͭͷ଎౓ͷ࿩͕औΓҧ͑ΒΕ͍ͯΔʁ • [Kasai et al. 2020]ɿσίʔυͷߴ଎Խͷ࿩ • ຊ࿦จɿ܇࿅ͷߴ଎Խͷ࿩ • ͜͜ΛऔΓҧ͑ͯٞ࿦ͯ͠Δ࣌఺Ͱ͔ͳΓո͍͕͠… September 17, 2021 ࠷ઌ୺NLPษڧձ 16
  15. ϕʔεϥΠϯͷܗ͕ͳͥॏཁ͔ʁ • ࣮ݧͷ஌ݟ͕৴༻Ͱ͖ͳ͍ • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ • 6:6ઃఆͷ΄͏͕ऩଋ͕଎͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ •

    6:6ઃఆͷ΄͏͕ੑೳ͕ߴ͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ • ඞཁͳ஌ݟ͕ಘΒΕ͍ͯͳ͍ • DecoderଆͰReservoir͢Δ͜ͱ͸Մೳ͔ʁ September 17, 2021 ࠷ઌ୺NLPษڧձ 17
  16. ϕʔεϥΠϯͷௐࠪ September 17, 2021 ࠷ઌ୺NLPษڧձ 18 [Xu+2021] ͷ௥ࢼͱরΒ͠߹ΘͤͯΈͨ • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ

    à ଟ෼0, • 😃 σίʔμΛઙͨ͘͠΄͏͕܇࿅͕଎͍ • 😰 ͨͩɺऩଋ͕଎͍͔Ͳ͏͔͸Θ͔Βͳ͍ • AUCC஋͕ߴ͍͔Ͳ͏͔͸Θ͔Βͳ͍ • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ à /( • 😨 ઃఆͷ΄͏͕ߴੑೳʹݟ͑Δ [Xu+2021]
  17. ·ͱΊʢ࠶ܝʣ • എܠɾ໰୊ • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠ • ΞΠσΞ • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ •

    ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ • ߩݙ • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 19