Slide 1

Slide 1 text

ಡΉਓ ཧݚ"*1 ౦๺େֶ סݚڀࣨ ਗ਼໺ ॢ Reservoir Transformers Sheng Shen†, Alexei Baevski‡, Ari S. Morcos‡, Kurt Keutzer†, Michael Auli‡, Douwe Kiela‡ †UC Berkeley; ‡Facebook AI Research [email protected], [email protected] Abstract We demonstrate that transformers obtain im- pressive performance even when some of the layers are randomly initialized and never up- is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random lay- ers are interesting for several additional reasons. ※஫ऍͷͳ͍ਤද͸࿦จ͔ΒҾ༻͞Εͨ΋ͷͰ͢

Slide 2

Slide 2 text

ͲΜͳ࿦จ͔ʁ • എܠɾ໰୊ • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠ • ΞΠσΞ • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ • ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ • ߩݙ • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 4

Slide 3

Slide 3 text

ੈ͸·͞ʹҰେݴޠϞσϧ࣌୅ • All you need is Transformer? • ߴύϑΥʔϚϯε • ݴޠɺը૾ɺԻ੠Ͱେྲྀߦ • Vanillaઃఆ͕ڧ͗͢Δ • Vanillaͱѥछʹ΄ͱΜͲࠩ͸ͳ͍ [Narang+2021] • ॳొ৔͔ΒมΘͬͨͷ͸Layer-normalizationͷҐஔ͘Β͍ʁ • Scaling Law [Kaplan+2020] • σʔλ͕ଟ͚Ε͹ੑೳUP • Ϟσϧ͕େ͖͚Ε͹ੑೳUP September 17, 2021 ࠷ઌ୺NLPษڧձ 5 https://ruder.io/research-highlights-2020/ Transformer Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. [Kaplan+2020]

Slide 4

Slide 4 text

ڊେχϡʔϥϧωοτͷ܇࿅͸ͭΒ͍ • ͦ΋ͦ΋ϋʔυ΢ΣΞʢGPUʣ͕଍Γͳ͍ • 32GBΧʔυͷݶք͸1.4Bύϥϝλ [Ren+2021] • ܭࢉࢿݯͷݶΒΕͨݚڀάϧʔϓͰ͸܇࿅͕Ͱ͖ͳ͍ • ܇࿅ʹ͕͔͔࣌ؒΓ͗͢Δ • GPT-3ɿ1ຕͷGPUͰ355೥ʢΒ͍͠ʣ • Transformer (big)ɿDGX-2Λ࢖ͬͯҰ೔ • ίϯϖͷ৔߹͸16ϞσϧඞཁͩͬͨΓ͢Δ à 16೔ʁʁʁ • ؀ڥ΁ͷෛՙʢԹࣨޮՌΨεʣ΋ແࢹͰ͖ͳ͍ September 17, 2021 ࠷ઌ୺NLPษڧձ 6 ࠓճ͸ͬͪ͜ͷղܾΛ໨ࢦ͢

Slide 5

Slide 5 text

͓͞Β͍ɿTransformer • ಉ͡େ͖͞ͷ૚ΛੵΜͰੑೳ޲্ΛਤΔͷ͕ओྲྀ September 17, 2021 ࠷ઌ୺NLPษڧձ 7 Transformer-layer 🔥 Transformer-layer 🔥 Transformer-layer🔥 Transformer-layer🔥 Transformer-layer🔥 Multi-Head Attention Add & Norm Feed Forward Add & Norm Input +

Slide 6

Slide 6 text

ΞΠσΞɿϨΠϠʔΛߋ৽͠ͳ͍ • Ұ෦ͷϨΠϠʔ͸ॳظ஋Ͱݻఆ❄ • ଞͷϨΠϠʔ͸௨ৗ௨Γߋ৽🔥 • ߋ৽͢Δ૚ͱ͠ͳ͍૚Λࠞͥͯ഑ஔ • ֤ݻఆ૚͕ͳΔ΂͘཭ΕΔΑ͏ʹ September 17, 2021 ࠷ઌ୺NLPษڧձ 8 Transformer-layer 🔥 Transformer-layer ❄ Transformer-layer🔥 Transformer-layer❄ Transformer-layer🔥 ஫ɿ͜ͷίʔυ͸ಈ͖·ͤΜ

Slide 7

Slide 7 text

ఏҊख๏ɿResavoir Transformer • Resavoir Transformer • ॳظ஋Ͱݻఆ͞ΕͨϨΠϠʔ❄Λ ؚΉΑ͏ͳTransformer • ఏҊख๏ͷؾ࣋ͪ • ύϥϝʔλߋ৽ΛαϘΕΔͷͰ ߴ଎Խ • ॳظ஋ϨΠϠʔ͕ಛ௃நग़Λ ؤுͬͯ͘ΕΔ͜ͱΛفΔ • ѱ͘ݴ͑͹΍ͬͯΈͨܥͷ࿩ September 17, 2021 ࠷ઌ୺NLPษڧձ 9 Transformer-layer 🔥 Transformer-layer ❄ Transformer-layer🔥 Transformer-layer❄ Transformer-layer🔥

Slide 8

Slide 8 text

ϨβϘΞͷ͖΋ͪ • χϡʔϥϧωοτϫʔΫͬͯద౰ͩ͠ɺ ϊΠζతʹ͏·͘ಈ࡞ͯ͘͠ΕΔΜ͡Όͳ͍…ʁ • Residual઀ଓͷ͓͔͛Ͱେৎ෉ʁ • ݻఆϨΠϠʔ❄͕༗ӹͳಛ௃Λ࡞Δ͜ͱΛظ଴͢Δ • ࡞Εͳ͍৔߹ɿֶशϨΠϠʔ🔥 ͷϊϧϜ > ݻఆϨΠϠʔ❄ ͷ ϊϧϜͱ͢Ε͹ɺݻఆϨΠϠʔ͸ແࢹͰ͖Δʁ • ྨࣅʁख๏ɿAdapter • ֶशࡁΈͷTransformerʹখ͞ͳϞδϡʔϧΛ ௥Ճͯ͠finetuning • ϥϯμϜॳظԽ෦෼͸߃౳ࣸ૾͔Βελʔτ͢ΔͨΊ finetuneݩͷ৘ใ͸ഁյ͞Εͳ͍ • Residual઀ଓͷҐஔ͕ඇৗʹॏཁΒ͍͠ • [Pfeiffer+2020] September 17, 2021 ࠷ઌ୺NLPษڧձ 10 ॳظ஋ͷ··ݻఆͯ͠ຊ౰ʹେৎ෉ͳͷ͔ʁ 2.2.1 Single-task Adapters For each of the N tasks, the model is initialized with parameters ⇥0. In addition, a set of new and randomly initialized parameters n are introduced (the adapter parameters). To share the same set of parameters ⇥0 across all otherwise independent tasks, the parameters in ⇥0 are fixed and only the parameters n are trained. This makes it possible to efficiently parallelize the training of adapters for all N tasks. The objective for each task n 2 1, . . . , N is of the form: n argmin Ln(Dn; ⇥0, ) For common adapter architectures, contains considerably fewer parameters than ⇥, e.g., only 3.6% of the parameters of the pre-trained model in (Houlsby et al., 2019). Feed Forward Add & Norm Adapter Multi-Head Attention Add & Norm Add & Norm LayerNorm LayerNorm FF Down FF Up Add & Norm Feed Forward Multi-Head Attention Add & Norm Add & Norm FF Down FF Up Add & Norm Adapter Adapter Figure 2: Different architectural components of the adapter. On the left, we show all components for which

Slide 9

Slide 9 text

ݻఆϨΠϠʔ ❄ ͷطଘݚڀ • ॳظ஋ݻఆͷLSTM͕ڧ͍จΤϯίʔμͰ͋Δ [Wieting+2019] • Skip-thought΍Infersentͱͷ͕ࠩখ͍͜͞ͱΛࣔͨ͠ • ϥϯμϜࣸ૾Λ༻͍ͨ࣍ݩ࡟ݮ [Sahlgren 2005] • ෦෼ۭؒ΁ͷࣸ૾͕ݩۭؒͰͷڑ཭Λอ࣋ September 17, 2021 ࠷ઌ୺NLPษڧձ 11 Published as a conference paper at ICLR 2019 Model Dim MR CR MPQA SUBJ SST2 TREC SICK-R SICK-E MRPC STSB BOE 300 77.3(.2) 78.6(.3) 87.6(.1) 91.3(.1) 80.0(.5) 81.5(.8) 80.2(.1) 78.7(.1) 72.9(.3) 70.5(.1) BOREP 4096 77.4(.4) 79.5(.2) 88.3(.2) 91.9(.2) 81.8(.4) 88.8(.3) 85.5(.1) 82.7(.7) 73.9(.4) 68.5(.6) RandLSTM 4096 77.2(.3) 78.7(.5) 87.9(.1) 91.9(.2) 81.5(.3) 86.5(1.1) 85.5(.1) 81.8(.5) 74.1(.5) 72.4(.5) ESN 4096 78.1(.3) 80.0(.6) 88.5(.2) 92.6(.1) 83.0(.5) 87.9(1.0) 86.1(.1) 83.1(.4) 73.4(.4) 74.4(.3) InferSent-1 = paper, G 4096 81.1 86.3 90.2 92.4 84.6 88.2 88.3 86.3 76.2 75.6 InferSent-2 = fixed pad, F 4096 79.7 84.2 89.4 92.7 84.3 90.8 88.8 86.3 76.0 78.4 InferSent-3 = fixed pad, G 4096 79.7 83.4 88.9 92.6 83.5 90.8 88.5 84.1 76.4 77.3 InferSent-3, BestRand - 1.6 3.4 0.4 0.0 0.5 2.0 2.4 1.0 2.3 2.9 ST-LN 4800 79.4 83.1 89.3 93.7 82.9 88.4 85.8 79.5 73.2 68.9 ST-LN, BestRand - 1.3 3.1 0.8 1.1 -0.1 0.5 -0.3 -3.6 -0.9 -5.5 Table 1: Performance (accuracy for all tasks except SICK-R and STSB, for which we report Pear- son’s r) on all ten downstream tasks where all models have 4096 dimensions with the exception of BOE (300) and ST-LN (4800). Standard deviations are show in parentheses. InferSent-1 is the paper [Wieting+2019]

Slide 10

Slide 10 text

࣮ݧઃఆ • ໨త • Reservoir TransformerͷޮՌΛݕূ • ϕʔεϥΠϯ • Transformer • λεΫ • ػց຋༁ʢ8.5*84-5ʣ • ݴޠϞσϧʢenwiki8ʣ • ૒ํ޲ݴޠϞσϧʢMNLIɺSST-2ͳͲʣ • ධՁࢦඪ • BLEUʢࢀর༁ͱͷҰக཰ʣ • AUCCʢAUC-ROCͷֶश଎౓൛; ޙड़ʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 12 ࣌ؒͷ౎߹্͚ͩ͜͜

Slide 11

Slide 11 text

ධՁࢦඪɿAUCC September 17, 2021 ࠷ઌ୺NLPษڧձ 13 ResavoirԽʹΑΔֶश଎౓΁ͷӨڹΛௐ΂͍ͨ • AUCC: area under the convergence curve • ֶशۂઢΛੵ෼ͯ͠ܭࢉ • ஋͕େ͖͍ → ऩଋ͕଎͍ ʢand ੑೳ͕ྑ͍ʣ ͜͜Λੵ෼ ܇࿅࣌ؒ BLEU ద౰ͳ෯Ͱଧͪ੾Γ

Slide 12

Slide 12 text

࣮ݧ݁Ռ on ଎౓ September 17, 2021 ࠷ઌ୺NLPษڧձ 14 Figure 8: Validation BLEU AUCC an IWSLT (high is good). Comparison former and reservoir transformer with former reservoir layers added. • Τϯίʔμͷ૚਺͕… • খ͍͞ͱ͖ɿϕʔεϥΠϯͷ΄͏͕ߴ଎ʁ • େ͖͍ͱ͖ɿఏҊख๏ͷ΄͏͕ߴ଎ʁ • Ұ؏ͯ͠ྑ͍ख๏͸ଘࡏ͠ͳ͍ • ֶश࣌ؒΛ1ׂݮΒͤΔ͔΋ɺͱ͍͏ఔ౓ Կ૚ݻఆ ͍ͯ͠Δ͔͸Ṗ Ͳ͜ʹ఺͕ ଧͨΕ͍ͯΔ͔΋Ṗ

Slide 13

Slide 13 text

࣮ݧ݁Ռ on BLEU • Resavoir͸ϨΠϠʔ਺͕ଟ͍ͱ͖ʹޮՌ͋Γʁ • ਖ਼ଇԽͱͯ͠ಈ࡞ʁ • Ϟσϧ͕Over-parametrize͞Εͨঢ়گԼͰ͸༗ޮ͔΋ • ੑೳ͸΄ͱΜͲಉ͡ɺͱ͍͏ͷ͕੣࣮ͦ͏ September 17, 2021 ࠷ઌ୺NLPษڧձ 15 ion BLEU AUCC and test BLEU for good). Comparison of regular trans- voir transformer with FFN or Trans- layers added. Figure 1 enwik8 parison ing dep Կ૚ݻఆͯ͠ ͍Δ͔͸Ṗ

Slide 14

Slide 14 text

ؾʹͳΔ͜ͱɿϕʔεϥΠϯ • ϕʔεϥΠϯͷTransformerͷܗ͕͓͔͍͠ • Encoder N૚ʹରͯ͠Decoder 1૚ [Kasai et al. 2020] • ී௨͸6૚ɿ6૚Ͱ࣮ݧ • ஶऀʮߴ଎ͳઃఆ͔ͩΒ࠾༻ͨ͠ʯ • ଥ౰ͳઃఆͰ͸ͳ͍ؾ͕͢Δ… • ੈͷதͷΤϯίʔμɾσίʔμͱ͔͚཭Εͨઃఆ • 2ͭͷ଎౓ͷ࿩͕औΓҧ͑ΒΕ͍ͯΔʁ • [Kasai et al. 2020]ɿσίʔυͷߴ଎Խͷ࿩ • ຊ࿦จɿ܇࿅ͷߴ଎Խͷ࿩ • ͜͜ΛऔΓҧ͑ͯٞ࿦ͯ͠Δ࣌఺Ͱ͔ͳΓո͍͕͠… September 17, 2021 ࠷ઌ୺NLPษڧձ 16

Slide 15

Slide 15 text

ϕʔεϥΠϯͷܗ͕ͳͥॏཁ͔ʁ • ࣮ݧͷ஌ݟ͕৴༻Ͱ͖ͳ͍ • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ • 6:6ઃఆͷ΄͏͕ऩଋ͕଎͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ • 6:6ઃఆͷ΄͏͕ੑೳ͕ߴ͍ɺͱ͍͏͜ͱ͸ͳ͍͔ʁ • ඞཁͳ஌ݟ͕ಘΒΕ͍ͯͳ͍ • DecoderଆͰReservoir͢Δ͜ͱ͸Մೳ͔ʁ September 17, 2021 ࠷ઌ୺NLPษڧձ 17

Slide 16

Slide 16 text

ϕʔεϥΠϯͷௐࠪ September 17, 2021 ࠷ઌ୺NLPษڧձ 18 [Xu+2021] ͷ௥ࢼͱরΒ͠߹ΘͤͯΈͨ • ϕʔεϥΠϯͷऩଋ͸े෼ʹߴ଎͔ʁ à ଟ෼0, • 😃 σίʔμΛઙͨ͘͠΄͏͕܇࿅͕଎͍ • 😰 ͨͩɺऩଋ͕଎͍͔Ͳ͏͔͸Θ͔Βͳ͍ • AUCC஋͕ߴ͍͔Ͳ͏͔͸Θ͔Βͳ͍ • ϕʔεϥΠϯͷੑೳ͸े෼ʹߴ͍ͷ͔ʁ à /( • 😨 ઃఆͷ΄͏͕ߴੑೳʹݟ͑Δ [Xu+2021]

Slide 17

Slide 17 text

·ͱΊʢ࠶ܝʣ • എܠɾ໰୊ • Transformerͷֶशʹ͔͔Δ࣌ؒΛ୹͍ͨ͘͠ • ΞΠσΞ • TransformerͷҰ෦ͷϨΠϠʔΛॳظ஋ͷ··ݻఆ • ύϥϝʔλߋ৽ͷܭࢉ͕αϘΕΔͷͰߴ଎Խ • ߩݙ • Ϟσϧͷऩଋ͕~30%΄Ͳ଎͘ͳΔʢ৔߹΋͋Δʣ • Ϟσϧͷੑೳ্͕͕Δʢ৔߹΋͋Δʣ September 17, 2021 ࠷ઌ୺NLPษڧձ 19