$30 off During Our Annual Pro Sale. View Details »
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reformer: The Efficient Transformer
Search
Scatter Lab Inc.
February 06, 2020
Research
1
2.4k
Reformer: The Efficient Transformer
Scatter Lab Inc.
February 06, 2020
Tweet
Share
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.8k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.2k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
湯村研究室の紹介2025 / yumulab2025
yumulab
0
180
SREのためのテレメトリー技術の探究 / Telemetry for SRE
yuukit
12
2k
若手研究者が国際会議(例えばIROS)でワークショップを企画するメリットと成功法!
tanichu
0
110
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
satai
3
340
AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data
satai
3
470
音声感情認識技術の進展と展望
nagase
0
360
Satellites Reveal Mobility: A Commuting Origin-destination Flow Generator for Global Cities
satai
3
120
AIスパコン「さくらONE」の オブザーバビリティ / Observability for AI Supercomputer SAKURAONE
yuukit
2
890
J-RAGBench: 日本語RAGにおける Generator評価ベンチマークの構築
koki_itai
0
980
さまざまなAgent FrameworkとAIエージェントの評価
ymd65536
1
300
生成AI による論文執筆サポート・ワークショップ ─ サーベイ/リサーチクエスチョン編 / Workshop on AI-Assisted Paper Writing Support: Survey/Research Question Edition
ks91
PRO
0
110
Nullspace MPC
mizuhoaoki
1
390
Featured
See All Featured
Rails Girls Zürich Keynote
gr2m
95
14k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Leading Effective Engineering Teams in the AI Era
addyosmani
8
1.2k
Speed Design
sergeychernyshev
33
1.3k
KATA
mclloyd
PRO
32
15k
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
How STYLIGHT went responsive
nonsquared
100
5.9k
The Pragmatic Product Professional
lauravandoore
36
7k
For a Future-Friendly Web
brad_frost
180
10k
Six Lessons from altMBA
skipperchong
29
4.1k
Building a Modern Day E-commerce SEO Strategy
aleyda
45
8.1k
GraphQLとの向き合い方2022年版
quramy
49
14k
Transcript
Reformer: The Efficient Transformer ҳ࢚ળ (ML Research Scientist, Pingpong)
Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃
ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प Ѿҗ ࠙ࢳ
1. ѐਃ Reformer : The Efficient Transformer
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ)
Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ) • োझۣѱ ࢤӡ ࣻ ח ࢤӡ ࣻ ח ޙ: ؊ ޙઁীب ਊೡ ࣻ ঋਸө? • ੑ۱ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ܲ ഋక ੑ۱ۄݶ? • ҃, ੑ۱ द௫झ ӡо K ױਤীࢲ ਊؽ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ ੑ۱ ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ਸө?
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ডೡ ࣻ
2. ߓ҃ ध Reformer : The Efficient Transformer
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹)
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ъ ࢸݺ: п ؘఠನੋ(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ ೣ • оө ؘఠ ನੋٜ(X1, X2)ՙܻח ੌ೮ਵݶ જѷ (H(X1) = H(X2)) • ݢ ؘఠ ನੋٜ (X1, X3)ՙܻח ੌೞ ঋওਵݶ જѷ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ۧѱ ࠗৈೡ ࣻ ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ਸ ࣻ
Locality-Sensitive Hashing द 2.1. ߓ҃ ध - Locality-Sensitive Hashing •
Locality-Sensitive Hashing ࢎਊ द: ಞߣഐ Ѩ࢝ • оө ী ಞߣഐܳ ݢ ݫӣ • (ؘఠ ನੋ: ࢲद ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࢿزҳ ڣࢻ ਭҕਗ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࣠ҳ ৢܿ۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻীࢲ о оө ݍਸ Ҋ रਵݶ, • ڣࢻҗ э ಞߣഐܳ о ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ٜࣗ ীࢲ оө ݍਸ Ѩ࢝ೞݶ ؽ
Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ध - Locality-Sensitive Hashing
• LSH ҳഅ ߑߨ (ਗ: оө গٜՙܻח ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ࠺ ੋؙझܳ Hash чਵ۽ ਊ • MinHash (1997): ױয ࣽࢲٜਸ ਵ۽ ࠗৈ೮ਸ ٸ, о ࡅܲ ױযо ח ഛੋ • Continuous LSH • Random Projection (2002): ಣݶী ೠ ࢎ࢚ ࠗഐ ਸ Hash чਵ۽ ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ೧ࢲ ഥ ߸ജਸ ೮ਸ ٸ, э пبҵী חоо Hashч (??)
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ؘఠࣇী 2ରਗ ߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о Ҋ о • ܳ ߈ܴ 1ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤೞח ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ઁ ௪ܻী ೠ 2ରਗ ߬٬ ߭ఠ Q = (4, 3) Ҋ о. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ܳ ڙэ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ҃, Qী ೧ࢲ X1ਸ ਸ ࣻ 1 2 3 4 1 2 3 4 1 2 3 4
Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ध
- Reversible Residual Network • ޙઁ : Residual Networkীࢲ ള۲द ݫݽܻ ग • Residual Network (ResNet, He et al. 2015) • Activation ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ܖয Network • ResNet ژೠ gradient ӝ҅ੋ ҅ਸ ਤ೧ࢲח р activation ٜਸ ೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह ഋక۽ ӝࣿೞݶ Residual Block Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅ೡ ࣻ
Reversible Residual Network 2.2. ߓ҃ ध - Reversible Residual Network
• Y = X + F(X)ী ೧ࢲ ह ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ۠ धਵ۽ ӝࣿೞח ҃, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • , Gradient ҅ਸ ۱ч݅ਸ оҊ ೡ ࣻ -> р Ѿҗ ࠛਃೣ
3. ߑߨۿ Reformer : The Efficient Transformer
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ܳ ٜয աಫۨ৪ ਤੋ Ҋ о೧ࠁݶ • یझ, ടઁ, աಫۨ৪, ҵ э ױযח оо ѪҊ • प೯೮, ঈࣻ, ࡈр, যܽ э ױযח оо ਸ Ѫ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೞݶ ࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೡ ࣻ ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫ हਸ ٮ
Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of
Q • Q৬ V Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ ғ shape: (batch_size, length, length) —> ݫݽܻী ٜযо ঋ • п ߓ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ • ߽۳ࢿਸ ನӝೞ݅, ݫݽܻ ࢎਊ O(L^2) ীࢲ O(L)۽ ੌ ࣻо Attention(qi , K, V) = softmax( qi KT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ • ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр ؘఠ۽ рೡ ࣻ
LSH Attention 3. ߑߨۿ • Query = Key۽ Attention Sequenceܳ
ೠ ۽ աఋյ ࣻ • LSH Hash Bucketing (э Hashܳ о Queryՙܻ द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing
• Bucket ӝо Ӑ١ೞ ঋਵ۽ ੌೠ ӝ۽ Chunking • ߄۽ Chunk৬ ӝ न ࣘೠ Chunkীࢲ नҗ э Bucketਸ о গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • ਬ ࢎ೦ •
ੌ߈ੋ Transformerীࢲח ӝ नਸ Attendೞ݅, ҳઑীࢲח Attend ೞ ঋ • Transformer Decoding दীח ې ੋؙझܳ ࠁ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ ঋ ҃о ਵ۽ Multi Hashܳ ॄঠೣ
Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ Ӕ ࠂب
࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষաѱ ఃݶ ࠂبܳ ੌ ࣻ : ਗ ֤ޙীࢲח 16384ѐ
Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ
Ӕ ࠂب ࠺Ү (೧Ѿ???) • ৈ ޙઁо : FeedForward Layer ী ೠ ࠂب • बয, • ਗې Transformerীࢲ о ޙઁо উغחؘ l ٸޙী… • ੌױ ࠗఠ ܻܳ ೧ࠁب۾ ೧ࠁ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=
X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ҳઑ۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅ਸ ೞݶ ؽ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block ো •
Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ۧѱ ೞݶ ۽ ٜ݅যח ݫݽܻ ࢎਊب ੌ ࣻ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
Reformer दр ࠂب 3. ߑߨۿ • Reformer Ӕ दр ࠂب
4. प ࠙ࢳ Reformer : The Efficient Transformer
Duplication Experiment 4. प ࠙ࢳ • प ߑߨ: 511ӡ string
w ী ೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ೧ࢲ җ э • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझೞݶ ੜ ؽ! (Inference Hash іࣻо ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ب غݶ Full-Attentionҗ ࢿמ ࠺त
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ࢚ غݶ ࢿמ ରо ঋ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ seconds per stepਸ ஏ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence ӡী ೱਸ ߉ ঋ
Ѿۿ 4. प ࠙ࢳ • ֤ޙ • Reformerח LSHܳ
Attentionী ਊೞৈ ࠺तೠ ױযٜр Attentionਸ ೡ ࣻ ب۾ ೣ • प Ѿҗ LSHܳ ਤೠ оࢸٜ ૐݺغਵݴ ࢿמਸ ਬೞݶࢲ ࠺ডਵ۽ दрਸ ੌ ࣻ • ܻীѱ दࢎೞח • Wiki8 ؘఠীࢲب ࢎਊೡ ࣻ ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ
хࢎפ Reformer : The Efficient Transformer