Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reformer: The Efficient Transformer
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Scatter Lab Inc.
February 06, 2020
Research
2.5k
1
Share
Reformer: The Efficient Transformer
Scatter Lab Inc.
February 06, 2020
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.9k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.4k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
Collective Predictive Coding and World Models in LLMs: A System 0/1/2/3 Perspective on Hierarchical Physical AI (IEEE SII 2026 Plenary Talk)
tanichu
1
390
Tiaccoon: Unified Access Control with Multiple Transports in Container Networks
hiroyaonoe
0
1.7k
Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering
anatolykr
0
160
【NICOGRAPH2025】Photographic Conviviality: ボディペイント・ワークショップによる 同時的かつ共生的な写真体験
toremolo72
0
230
National high-resolution cropland classification of Japan with agricultural census information and multi-temporal multi-modality datasets
satai
2
110
Unified Audio Source Separation (Defense Slides)
kohei_1979
1
600
オーストリア流 都市の公共交通サービス水準評価@公共交通オープンデータ最前線2026
trafficbrain
0
150
さくらインターネット研究所テックトーク2026春、研究開発Gr.25年度成果26年度方針
kikuzo
0
130
AIエージェント時代のLLM-jpモデルのあるべき姿
k141303
0
340
世界モデルにおける分布外データ対応の方法論
koukyo1994
7
2.2k
それ、チームの改善になってますか?ー「チームとは?」から始めた組織の実験ー
hirakawa51
0
1.1k
業界横断 副業コンプライアンス調査 三者(副業者・本業先・発注者)におけるトラブル認知ギャップの構造分析
fkske
0
1.3k
Featured
See All Featured
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
25
1.9k
Everyday Curiosity
cassininazir
0
210
Rebuilding a faster, lazier Slack
samanthasiow
85
9.5k
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
1.1k
My Coaching Mixtape
mlcsv
0
130
Leveraging Curiosity to Care for An Aging Population
cassininazir
1
240
StorybookのUI Testing Handbookを読んだ
zakiyama
31
6.7k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Information Architects: The Missing Link in Design Systems
soysaucechin
0
920
Learning to Love Humans: Emotional Interface Design
aarron
275
41k
Building Better People: How to give real-time feedback that sticks.
wjessup
370
20k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
32
2.9k
Transcript
Reformer: The Efficient Transformer ҳ࢚ળ (ML Research Scientist, Pingpong)
Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃
ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प Ѿҗ ࠙ࢳ
1. ѐਃ Reformer : The Efficient Transformer
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ)
Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ) • োझۣѱ ࢤӡ ࣻ ח ࢤӡ ࣻ ח ޙ: ؊ ޙઁীب ਊೡ ࣻ ঋਸө? • ੑ۱ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ܲ ഋక ੑ۱ۄݶ? • ҃, ੑ۱ द௫झ ӡо K ױਤীࢲ ਊؽ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ ੑ۱ ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ਸө?
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ডೡ ࣻ
2. ߓ҃ ध Reformer : The Efficient Transformer
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹)
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ъ ࢸݺ: п ؘఠನੋ(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ ೣ • оө ؘఠ ನੋٜ(X1, X2)ՙܻח ੌ೮ਵݶ જѷ (H(X1) = H(X2)) • ݢ ؘఠ ನੋٜ (X1, X3)ՙܻח ੌೞ ঋওਵݶ જѷ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ۧѱ ࠗৈೡ ࣻ ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ਸ ࣻ
Locality-Sensitive Hashing द 2.1. ߓ҃ ध - Locality-Sensitive Hashing •
Locality-Sensitive Hashing ࢎਊ द: ಞߣഐ Ѩ࢝ • оө ী ಞߣഐܳ ݢ ݫӣ • (ؘఠ ನੋ: ࢲद ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࢿزҳ ڣࢻ ਭҕਗ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࣠ҳ ৢܿ۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻীࢲ о оө ݍਸ Ҋ रਵݶ, • ڣࢻҗ э ಞߣഐܳ о ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ٜࣗ ীࢲ оө ݍਸ Ѩ࢝ೞݶ ؽ
Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ध - Locality-Sensitive Hashing
• LSH ҳഅ ߑߨ (ਗ: оө গٜՙܻח ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ࠺ ੋؙझܳ Hash чਵ۽ ਊ • MinHash (1997): ױয ࣽࢲٜਸ ਵ۽ ࠗৈ೮ਸ ٸ, о ࡅܲ ױযо ח ഛੋ • Continuous LSH • Random Projection (2002): ಣݶী ೠ ࢎ࢚ ࠗഐ ਸ Hash чਵ۽ ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ೧ࢲ ഥ ߸ജਸ ೮ਸ ٸ, э пبҵী חоо Hashч (??)
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ؘఠࣇী 2ରਗ ߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о Ҋ о • ܳ ߈ܴ 1ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤೞח ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ઁ ௪ܻী ೠ 2ରਗ ߬٬ ߭ఠ Q = (4, 3) Ҋ о. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ܳ ڙэ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ҃, Qী ೧ࢲ X1ਸ ਸ ࣻ 1 2 3 4 1 2 3 4 1 2 3 4
Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ध
- Reversible Residual Network • ޙઁ : Residual Networkীࢲ ള۲द ݫݽܻ ग • Residual Network (ResNet, He et al. 2015) • Activation ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ܖয Network • ResNet ژೠ gradient ӝ҅ੋ ҅ਸ ਤ೧ࢲח р activation ٜਸ ೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह ഋక۽ ӝࣿೞݶ Residual Block Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅ೡ ࣻ
Reversible Residual Network 2.2. ߓ҃ ध - Reversible Residual Network
• Y = X + F(X)ী ೧ࢲ ह ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ۠ धਵ۽ ӝࣿೞח ҃, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • , Gradient ҅ਸ ۱ч݅ਸ оҊ ೡ ࣻ -> р Ѿҗ ࠛਃೣ
3. ߑߨۿ Reformer : The Efficient Transformer
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ܳ ٜয աಫۨ৪ ਤੋ Ҋ о೧ࠁݶ • یझ, ടઁ, աಫۨ৪, ҵ э ױযח оо ѪҊ • प೯೮, ঈࣻ, ࡈр, যܽ э ױযח оо ਸ Ѫ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೞݶ ࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೡ ࣻ ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫ हਸ ٮ
Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of
Q • Q৬ V Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ ғ shape: (batch_size, length, length) —> ݫݽܻী ٜযо ঋ • п ߓ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ • ߽۳ࢿਸ ನӝೞ݅, ݫݽܻ ࢎਊ O(L^2) ীࢲ O(L)۽ ੌ ࣻо Attention(qi , K, V) = softmax( qi KT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ • ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр ؘఠ۽ рೡ ࣻ
LSH Attention 3. ߑߨۿ • Query = Key۽ Attention Sequenceܳ
ೠ ۽ աఋյ ࣻ • LSH Hash Bucketing (э Hashܳ о Queryՙܻ द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing
• Bucket ӝо Ӑ١ೞ ঋਵ۽ ੌೠ ӝ۽ Chunking • ߄۽ Chunk৬ ӝ न ࣘೠ Chunkীࢲ नҗ э Bucketਸ о গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • ਬ ࢎ೦ •
ੌ߈ੋ Transformerীࢲח ӝ नਸ Attendೞ݅, ҳઑীࢲח Attend ೞ ঋ • Transformer Decoding दীח ې ੋؙझܳ ࠁ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ ঋ ҃о ਵ۽ Multi Hashܳ ॄঠೣ
Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ Ӕ ࠂب
࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষաѱ ఃݶ ࠂبܳ ੌ ࣻ : ਗ ֤ޙীࢲח 16384ѐ
Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ
Ӕ ࠂب ࠺Ү (೧Ѿ???) • ৈ ޙઁо : FeedForward Layer ী ೠ ࠂب • बয, • ਗې Transformerীࢲ о ޙઁо উغחؘ l ٸޙী… • ੌױ ࠗఠ ܻܳ ೧ࠁب۾ ೧ࠁ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=
X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ҳઑ۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅ਸ ೞݶ ؽ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block ো •
Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ۧѱ ೞݶ ۽ ٜ݅যח ݫݽܻ ࢎਊب ੌ ࣻ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
Reformer दр ࠂب 3. ߑߨۿ • Reformer Ӕ दр ࠂب
4. प ࠙ࢳ Reformer : The Efficient Transformer
Duplication Experiment 4. प ࠙ࢳ • प ߑߨ: 511ӡ string
w ী ೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ೧ࢲ җ э • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझೞݶ ੜ ؽ! (Inference Hash іࣻо ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ب غݶ Full-Attentionҗ ࢿמ ࠺त
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ࢚ غݶ ࢿמ ରо ঋ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ seconds per stepਸ ஏ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence ӡী ೱਸ ߉ ঋ
Ѿۿ 4. प ࠙ࢳ • ֤ޙ • Reformerח LSHܳ
Attentionী ਊೞৈ ࠺तೠ ױযٜр Attentionਸ ೡ ࣻ ب۾ ೣ • प Ѿҗ LSHܳ ਤೠ оࢸٜ ૐݺغਵݴ ࢿמਸ ਬೞݶࢲ ࠺ডਵ۽ दрਸ ੌ ࣻ • ܻীѱ दࢎೞח • Wiki8 ؘఠীࢲب ࢎਊೡ ࣻ ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ
хࢎפ Reformer : The Efficient Transformer