Alexei Baevski‡, Ari S. Morcos‡, Kurt Keutzer†, Michael Auli‡, Douwe Kiela‡ †UC Berkeley; ‡Facebook AI Research sheng.s@berkeley.edu, dkiela@fb.com Abstract We demonstrate that transformers obtain im- pressive performance even when some of the layers are randomly initialized and never up- is more, we find that freezing layers may actually improve performance. Beyond desirable efficiency gains, random lay- ers are interesting for several additional reasons. ※ऍͷͳ͍ਤදจ͔ΒҾ༻͞ΕͨͷͰ͢
ݴޠɺը૾ɺԻͰେྲྀߦ • Vanillaઃఆ͕ڧ͗͢Δ • Vanillaͱѥछʹ΄ͱΜͲࠩͳ͍ [Narang+2021] • ॳొ͔ΒมΘͬͨͷLayer-normalizationͷҐஔ͘Β͍ʁ • Scaling Law [Kaplan+2020] • σʔλ͕ଟ͚ΕੑೳUP • Ϟσϧ͕େ͖͚ΕੑೳUP September 17, 2021 ࠷ઌNLPษڧձ 5 https://ruder.io/research-highlights-2020/ Transformer Dataset Size tokens Parameters non-embedding Compute PF-days, non-embedding Test Loss Figure 1 Language modeling performance improves smoothly as we increase the model size, datasetset size, and amount of compute2 used for training. For optimal performance all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two. [Kaplan+2020]
ͷϊϧϜ > ݻఆϨΠϠʔ❄ ͷ ϊϧϜͱ͢ΕɺݻఆϨΠϠʔແࢹͰ͖Δʁ • ྨࣅʁख๏ɿAdapter • ֶशࡁΈͷTransformerʹখ͞ͳϞδϡʔϧΛ Ճͯ͠finetuning • ϥϯμϜॳظԽ෦߃ࣸ૾͔Βελʔτ͢ΔͨΊ finetuneݩͷใഁյ͞Εͳ͍ • ResidualଓͷҐஔ͕ඇৗʹॏཁΒ͍͠ • [Pfeiffer+2020] September 17, 2021 ࠷ઌNLPษڧձ 10 ॳظͷ··ݻఆͯ͠ຊʹେৎͳͷ͔ʁ 2.2.1 Single-task Adapters For each of the N tasks, the model is initialized with parameters ⇥0. In addition, a set of new and randomly initialized parameters n are introduced (the adapter parameters). To share the same set of parameters ⇥0 across all otherwise independent tasks, the parameters in ⇥0 are fixed and only the parameters n are trained. This makes it possible to efficiently parallelize the training of adapters for all N tasks. The objective for each task n 2 1, . . . , N is of the form: n argmin Ln(Dn; ⇥0, ) For common adapter architectures, contains considerably fewer parameters than ⇥, e.g., only 3.6% of the parameters of the pre-trained model in (Houlsby et al., 2019). Feed Forward Add & Norm Adapter Multi-Head Attention Add & Norm Add & Norm LayerNorm LayerNorm FF Down FF Up Add & Norm Feed Forward Multi-Head Attention Add & Norm Add & Norm FF Down FF Up Add & Norm Adapter Adapter Figure 2: Different architectural components of the adapter. On the left, we show all components for which
ੑೳ΄ͱΜͲಉ͡ɺͱ͍͏ͷ͕࣮ͦ͏ September 17, 2021 ࠷ઌNLPษڧձ 15 ion BLEU AUCC and test BLEU for good). Comparison of regular trans- voir transformer with FFN or Trans- layers added. Figure 1 enwik8 parison ing dep Կݻఆͯ͠ ͍Δ͔Ṗ