Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reformer: The Efficient Transformer
Search
Scatter Lab Inc.
February 06, 2020
Research
1
2.4k
Reformer: The Efficient Transformer
Scatter Lab Inc.
February 06, 2020
Tweet
Share
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.8k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.1k
Adversarial Filters of Dataset Biases
scatterlab
0
2.2k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.2k
Other Decks in Research
See All in Research
大規模な2値整数計画問題に対する 効率的な重み付き局所探索法
mickey_kubo
1
360
AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data
satai
1
220
cvpaper.challenge 10年の軌跡 / cvpaper.challenge a decade-long journey
gatheluck
3
310
言語モデルの地図:確率分布と情報幾何による類似性の可視化
shimosan
5
1.5k
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
kurita
0
170
[輪講] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
nk35jk
2
1k
電力システム最適化入門
mickey_kubo
1
930
MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation
satai
4
200
CVPR2025論文紹介:Unboxed
murakawatakuya
0
150
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
0
160
【緊急警告】日本の未来設計図 ~沈没か、再生か。国民と断行するラストチャンス~
yuutakasan
0
150
[論文紹介] Intuitive Fine-Tuning
ryou0634
0
110
Featured
See All Featured
Optimising Largest Contentful Paint
csswizardry
37
3.4k
Git: the NoSQL Database
bkeepers
PRO
431
66k
A designer walks into a library…
pauljervisheath
207
24k
Producing Creativity
orderedlist
PRO
347
40k
Statistics for Hackers
jakevdp
799
220k
The Cost Of JavaScript in 2023
addyosmani
53
8.9k
Music & Morning Musume
bryan
46
6.8k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.6k
Build your cross-platform service in a week with App Engine
jlugia
231
18k
Building an army of robots
kneath
306
46k
The Cult of Friendly URLs
andyhume
79
6.6k
Bootstrapping a Software Product
garrettdimon
PRO
307
110k
Transcript
Reformer: The Efficient Transformer ҳ࢚ળ (ML Research Scientist, Pingpong)
Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃
ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प Ѿҗ ࠙ࢳ
1. ѐਃ Reformer : The Efficient Transformer
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ)
Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ) • োझۣѱ ࢤӡ ࣻ ח ࢤӡ ࣻ ח ޙ: ؊ ޙઁীب ਊೡ ࣻ ঋਸө? • ੑ۱ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ܲ ഋక ੑ۱ۄݶ? • ҃, ੑ۱ द௫झ ӡо K ױਤীࢲ ਊؽ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ ੑ۱ ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ਸө?
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ডೡ ࣻ
2. ߓ҃ ध Reformer : The Efficient Transformer
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹)
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ъ ࢸݺ: п ؘఠನੋ(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ ೣ • оө ؘఠ ನੋٜ(X1, X2)ՙܻח ੌ೮ਵݶ જѷ (H(X1) = H(X2)) • ݢ ؘఠ ನੋٜ (X1, X3)ՙܻח ੌೞ ঋওਵݶ જѷ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ۧѱ ࠗৈೡ ࣻ ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ਸ ࣻ
Locality-Sensitive Hashing द 2.1. ߓ҃ ध - Locality-Sensitive Hashing •
Locality-Sensitive Hashing ࢎਊ द: ಞߣഐ Ѩ࢝ • оө ী ಞߣഐܳ ݢ ݫӣ • (ؘఠ ನੋ: ࢲद ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࢿزҳ ڣࢻ ਭҕਗ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࣠ҳ ৢܿ۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻীࢲ о оө ݍਸ Ҋ रਵݶ, • ڣࢻҗ э ಞߣഐܳ о ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ٜࣗ ীࢲ оө ݍਸ Ѩ࢝ೞݶ ؽ
Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ध - Locality-Sensitive Hashing
• LSH ҳഅ ߑߨ (ਗ: оө গٜՙܻח ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ࠺ ੋؙझܳ Hash чਵ۽ ਊ • MinHash (1997): ױয ࣽࢲٜਸ ਵ۽ ࠗৈ೮ਸ ٸ, о ࡅܲ ױযо ח ഛੋ • Continuous LSH • Random Projection (2002): ಣݶী ೠ ࢎ࢚ ࠗഐ ਸ Hash чਵ۽ ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ೧ࢲ ഥ ߸ജਸ ೮ਸ ٸ, э пبҵী חоо Hashч (??)
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ؘఠࣇী 2ରਗ ߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о Ҋ о • ܳ ߈ܴ 1ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤೞח ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ઁ ௪ܻী ೠ 2ରਗ ߬٬ ߭ఠ Q = (4, 3) Ҋ о. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ܳ ڙэ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ҃, Qী ೧ࢲ X1ਸ ਸ ࣻ 1 2 3 4 1 2 3 4 1 2 3 4
Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ध
- Reversible Residual Network • ޙઁ : Residual Networkীࢲ ള۲द ݫݽܻ ग • Residual Network (ResNet, He et al. 2015) • Activation ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ܖয Network • ResNet ژೠ gradient ӝ҅ੋ ҅ਸ ਤ೧ࢲח р activation ٜਸ ೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह ഋక۽ ӝࣿೞݶ Residual Block Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅ೡ ࣻ
Reversible Residual Network 2.2. ߓ҃ ध - Reversible Residual Network
• Y = X + F(X)ী ೧ࢲ ह ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ۠ धਵ۽ ӝࣿೞח ҃, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • , Gradient ҅ਸ ۱ч݅ਸ оҊ ೡ ࣻ -> р Ѿҗ ࠛਃೣ
3. ߑߨۿ Reformer : The Efficient Transformer
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ܳ ٜয աಫۨ৪ ਤੋ Ҋ о೧ࠁݶ • یझ, ടઁ, աಫۨ৪, ҵ э ױযח оо ѪҊ • प೯೮, ঈࣻ, ࡈр, যܽ э ױযח оо ਸ Ѫ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೞݶ ࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೡ ࣻ ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫ हਸ ٮ
Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of
Q • Q৬ V Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ ғ shape: (batch_size, length, length) —> ݫݽܻী ٜযо ঋ • п ߓ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ • ߽۳ࢿਸ ನӝೞ݅, ݫݽܻ ࢎਊ O(L^2) ীࢲ O(L)۽ ੌ ࣻо Attention(qi , K, V) = softmax( qi KT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ • ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр ؘఠ۽ рೡ ࣻ
LSH Attention 3. ߑߨۿ • Query = Key۽ Attention Sequenceܳ
ೠ ۽ աఋյ ࣻ • LSH Hash Bucketing (э Hashܳ о Queryՙܻ द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing
• Bucket ӝо Ӑ١ೞ ঋਵ۽ ੌೠ ӝ۽ Chunking • ߄۽ Chunk৬ ӝ न ࣘೠ Chunkীࢲ नҗ э Bucketਸ о গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • ਬ ࢎ೦ •
ੌ߈ੋ Transformerীࢲח ӝ नਸ Attendೞ݅, ҳઑীࢲח Attend ೞ ঋ • Transformer Decoding दীח ې ੋؙझܳ ࠁ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ ঋ ҃о ਵ۽ Multi Hashܳ ॄঠೣ
Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ Ӕ ࠂب
࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষաѱ ఃݶ ࠂبܳ ੌ ࣻ : ਗ ֤ޙীࢲח 16384ѐ
Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ
Ӕ ࠂب ࠺Ү (೧Ѿ???) • ৈ ޙઁо : FeedForward Layer ী ೠ ࠂب • बয, • ਗې Transformerীࢲ о ޙઁо উغחؘ l ٸޙী… • ੌױ ࠗఠ ܻܳ ೧ࠁب۾ ೧ࠁ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=
X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ҳઑ۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅ਸ ೞݶ ؽ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block ো •
Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ۧѱ ೞݶ ۽ ٜ݅যח ݫݽܻ ࢎਊب ੌ ࣻ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
Reformer दр ࠂب 3. ߑߨۿ • Reformer Ӕ दр ࠂب
4. प ࠙ࢳ Reformer : The Efficient Transformer
Duplication Experiment 4. प ࠙ࢳ • प ߑߨ: 511ӡ string
w ী ೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ೧ࢲ җ э • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझೞݶ ੜ ؽ! (Inference Hash іࣻо ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ب غݶ Full-Attentionҗ ࢿמ ࠺त
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ࢚ غݶ ࢿמ ରо ঋ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ seconds per stepਸ ஏ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence ӡী ೱਸ ߉ ঋ
Ѿۿ 4. प ࠙ࢳ • ֤ޙ • Reformerח LSHܳ
Attentionী ਊೞৈ ࠺तೠ ױযٜр Attentionਸ ೡ ࣻ ب۾ ೣ • प Ѿҗ LSHܳ ਤೠ оࢸٜ ૐݺغਵݴ ࢿמਸ ਬೞݶࢲ ࠺ডਵ۽ दрਸ ੌ ࣻ • ܻীѱ दࢎೞח • Wiki8 ؘఠীࢲب ࢎਊೡ ࣻ ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ
хࢎפ Reformer : The Efficient Transformer