Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reformer: The Efficient Transformer
Search
Scatter Lab Inc.
February 06, 2020
Research
1
2.4k
Reformer: The Efficient Transformer
Scatter Lab Inc.
February 06, 2020
Tweet
Share
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.7k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4k
Adversarial Filters of Dataset Biases
scatterlab
0
2.2k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.4k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.2k
Other Decks in Research
See All in Research
Self-supervised audiovisual representation learning for remote sensing data
satai
3
210
When Submarine Cables Go Dark: Examining the Web Services Resilience Amid Global Internet Disruptions
irvin
0
200
近似動的計画入門
mickey_kubo
4
970
学生向けアンケート<データサイエンティストについて>
datascientistsociety
PRO
0
3k
EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing
satai
3
340
Type Theory as a Formal Basis of Natural Language Semantics
daikimatsuoka
1
220
SSII2025 [TS2] リモートセンシング画像処理の最前線
ssii
PRO
7
2.8k
SSII2025 [SS1] レンズレスカメラ
ssii
PRO
2
950
ASSADS:ASMR動画に合わせて撫でられる感覚を提示するシステムの開発と評価 / ec75-shimizu
yumulab
1
380
Generative Models 2025
takahashihiroshi
21
11k
3D Gaussian Splattingによる高効率な新規視点合成技術とその応用
muskie82
5
2.5k
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
satai
3
210
Featured
See All Featured
Into the Great Unknown - MozCon
thekraken
39
1.9k
ReactJS: Keep Simple. Everything can be a component!
pedronauck
667
120k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
790
Stop Working from a Prison Cell
hatefulcrawdad
270
20k
Designing Experiences People Love
moore
142
24k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
20
1.3k
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
A better future with KSS
kneath
239
17k
Become a Pro
speakerdeck
PRO
28
5.4k
Why Our Code Smells
bkeepers
PRO
337
57k
Connecting the Dots Between Site Speed, User Experience & Your Business [WebExpo 2025]
tammyeverts
5
210
A Tale of Four Properties
chriscoyier
160
23k
Transcript
Reformer: The Efficient Transformer ҳ࢚ળ (ML Research Scientist, Pingpong)
Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃
ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प Ѿҗ ࠙ࢳ
1. ѐਃ Reformer : The Efficient Transformer
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ)
Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ) • োझۣѱ ࢤӡ ࣻ ח ࢤӡ ࣻ ח ޙ: ؊ ޙઁীب ਊೡ ࣻ ঋਸө? • ੑ۱ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ܲ ഋక ੑ۱ۄݶ? • ҃, ੑ۱ द௫झ ӡо K ױਤীࢲ ਊؽ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ ੑ۱ ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ਸө?
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ডೡ ࣻ
2. ߓ҃ ध Reformer : The Efficient Transformer
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹)
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ъ ࢸݺ: п ؘఠನੋ(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ ೣ • оө ؘఠ ನੋٜ(X1, X2)ՙܻח ੌ೮ਵݶ જѷ (H(X1) = H(X2)) • ݢ ؘఠ ನੋٜ (X1, X3)ՙܻח ੌೞ ঋওਵݶ જѷ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ۧѱ ࠗৈೡ ࣻ ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ਸ ࣻ
Locality-Sensitive Hashing द 2.1. ߓ҃ ध - Locality-Sensitive Hashing •
Locality-Sensitive Hashing ࢎਊ द: ಞߣഐ Ѩ࢝ • оө ী ಞߣഐܳ ݢ ݫӣ • (ؘఠ ನੋ: ࢲद ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࢿزҳ ڣࢻ ਭҕਗ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࣠ҳ ৢܿ۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻীࢲ о оө ݍਸ Ҋ रਵݶ, • ڣࢻҗ э ಞߣഐܳ о ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ٜࣗ ীࢲ оө ݍਸ Ѩ࢝ೞݶ ؽ
Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ध - Locality-Sensitive Hashing
• LSH ҳഅ ߑߨ (ਗ: оө গٜՙܻח ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ࠺ ੋؙझܳ Hash чਵ۽ ਊ • MinHash (1997): ױয ࣽࢲٜਸ ਵ۽ ࠗৈ೮ਸ ٸ, о ࡅܲ ױযо ח ഛੋ • Continuous LSH • Random Projection (2002): ಣݶী ೠ ࢎ࢚ ࠗഐ ਸ Hash чਵ۽ ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ೧ࢲ ഥ ߸ജਸ ೮ਸ ٸ, э пبҵী חоо Hashч (??)
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ؘఠࣇী 2ରਗ ߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о Ҋ о • ܳ ߈ܴ 1ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤೞח ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ઁ ௪ܻী ೠ 2ରਗ ߬٬ ߭ఠ Q = (4, 3) Ҋ о. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ܳ ڙэ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ҃, Qী ೧ࢲ X1ਸ ਸ ࣻ 1 2 3 4 1 2 3 4 1 2 3 4
Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ध
- Reversible Residual Network • ޙઁ : Residual Networkীࢲ ള۲द ݫݽܻ ग • Residual Network (ResNet, He et al. 2015) • Activation ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ܖয Network • ResNet ژೠ gradient ӝ҅ੋ ҅ਸ ਤ೧ࢲח р activation ٜਸ ೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह ഋక۽ ӝࣿೞݶ Residual Block Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅ೡ ࣻ
Reversible Residual Network 2.2. ߓ҃ ध - Reversible Residual Network
• Y = X + F(X)ী ೧ࢲ ह ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ۠ धਵ۽ ӝࣿೞח ҃, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • , Gradient ҅ਸ ۱ч݅ਸ оҊ ೡ ࣻ -> р Ѿҗ ࠛਃೣ
3. ߑߨۿ Reformer : The Efficient Transformer
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ܳ ٜয աಫۨ৪ ਤੋ Ҋ о೧ࠁݶ • یझ, ടઁ, աಫۨ৪, ҵ э ױযח оо ѪҊ • प೯೮, ঈࣻ, ࡈр, যܽ э ױযח оо ਸ Ѫ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೞݶ ࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೡ ࣻ ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫ हਸ ٮ
Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of
Q • Q৬ V Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ ғ shape: (batch_size, length, length) —> ݫݽܻী ٜযо ঋ • п ߓ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ • ߽۳ࢿਸ ನӝೞ݅, ݫݽܻ ࢎਊ O(L^2) ীࢲ O(L)۽ ੌ ࣻо Attention(qi , K, V) = softmax( qi KT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ • ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр ؘఠ۽ рೡ ࣻ
LSH Attention 3. ߑߨۿ • Query = Key۽ Attention Sequenceܳ
ೠ ۽ աఋյ ࣻ • LSH Hash Bucketing (э Hashܳ о Queryՙܻ द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing
• Bucket ӝо Ӑ١ೞ ঋਵ۽ ੌೠ ӝ۽ Chunking • ߄۽ Chunk৬ ӝ न ࣘೠ Chunkীࢲ नҗ э Bucketਸ о গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • ਬ ࢎ೦ •
ੌ߈ੋ Transformerীࢲח ӝ नਸ Attendೞ݅, ҳઑীࢲח Attend ೞ ঋ • Transformer Decoding दীח ې ੋؙझܳ ࠁ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ ঋ ҃о ਵ۽ Multi Hashܳ ॄঠೣ
Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ Ӕ ࠂب
࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষաѱ ఃݶ ࠂبܳ ੌ ࣻ : ਗ ֤ޙীࢲח 16384ѐ
Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ
Ӕ ࠂب ࠺Ү (೧Ѿ???) • ৈ ޙઁо : FeedForward Layer ী ೠ ࠂب • बয, • ਗې Transformerীࢲ о ޙઁо উغחؘ l ٸޙী… • ੌױ ࠗఠ ܻܳ ೧ࠁب۾ ೧ࠁ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=
X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ҳઑ۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅ਸ ೞݶ ؽ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block ো •
Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ۧѱ ೞݶ ۽ ٜ݅যח ݫݽܻ ࢎਊب ੌ ࣻ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
Reformer दр ࠂب 3. ߑߨۿ • Reformer Ӕ दр ࠂب
4. प ࠙ࢳ Reformer : The Efficient Transformer
Duplication Experiment 4. प ࠙ࢳ • प ߑߨ: 511ӡ string
w ী ೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ೧ࢲ җ э • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझೞݶ ੜ ؽ! (Inference Hash іࣻо ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ب غݶ Full-Attentionҗ ࢿמ ࠺त
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ࢚ غݶ ࢿמ ରо ঋ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ seconds per stepਸ ஏ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence ӡী ೱਸ ߉ ঋ
Ѿۿ 4. प ࠙ࢳ • ֤ޙ • Reformerח LSHܳ
Attentionী ਊೞৈ ࠺तೠ ױযٜр Attentionਸ ೡ ࣻ ب۾ ೣ • प Ѿҗ LSHܳ ਤೠ оࢸٜ ૐݺغਵݴ ࢿמਸ ਬೞݶࢲ ࠺ডਵ۽ दрਸ ੌ ࣻ • ܻীѱ दࢎೞח • Wiki8 ؘఠীࢲب ࢎਊೡ ࣻ ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ
хࢎפ Reformer : The Efficient Transformer