Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Reformer: The Efficient Transformer
Search
Scatter Lab Inc.
February 06, 2020
Research
2.5k
1
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Reformer: The Efficient Transformer
Scatter Lab Inc.
February 06, 2020
More Decks by Scatter Lab Inc.
See All by Scatter Lab Inc.
zeta introduction
scatterlab
0
1.9k
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
scatterlab
0
4.4k
Adversarial Filters of Dataset Biases
scatterlab
0
2.3k
Sparse, Dense, and Attentional Representations for Text Retrieval
scatterlab
0
2.3k
Weight Poisoning Attacks on Pre-trained Models
scatterlab
0
2.2k
Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval
scatterlab
0
2.5k
Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
scatterlab
0
2.3k
Open-Retrieval Conversational Question Answering
scatterlab
0
2.3k
What Can Neural Networks Reason About?
scatterlab
0
2.3k
Other Decks in Research
See All in Research
重要だけど測れていないもの:高齢者ケアの見えない課題
theoriatec2024
0
320
衛星×エッジAI勉強会 衛星上におけるAI処理制約とそ取組について
satai
4
530
社内データ分析AIエージェントを できるだけ使いやすくする工夫
fufufukakaka
1
1.1k
Apache Gravitinoで実現する Icebergカタログ統合とアクセスの一元化
matsumooon
0
260
LLM の Attention 機構まとめ — 数式・計算量・メモリ
puwaer
7
2k
Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing
satai
3
810
[チュートリアル] 電波マップ構築入門 :研究動向と課題設定の勘所
k_sato
0
470
Data Visualization Tools in the Age of AI
flekschas
0
160
The mathematics of transformers
gpeyre
0
310
明日から使える!研究効率化ツール入門
matsui_528
13
7.2k
Anthropic が提案する LLM の内部状態を自然言語で説明可能にした Natural Language Autoencoders / Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
shunk031
0
110
第12回人と環境にやさしい交通をめざす全国大会/熊本都市圏「車1割削減、渋滞半減、公共交通2倍」をめざして
trafficbrain
0
110
Featured
See All Featured
Rails Girls Zürich Keynote
gr2m
96
14k
Embracing the Ebb and Flow
colly
88
5.1k
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
1
240
AI Search: Where Are We & What Can We Do About It?
aleyda
0
7.6k
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2.1k
From π to Pie charts
rasagy
0
200
Ecommerce SEO: The Keys for Success Now & Beyond - #SERPConf2024
aleyda
1
2k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
162
16k
B2B Lead Gen: Tactics, Traps & Triumph
marketingsoph
0
140
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
201
75k
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
The Anti-SEO Checklist Checklist. Pubcon Cyber Week
ryanjones
0
150
Transcript
Reformer: The Efficient Transformer ҳ࢚ળ (ML Research Scientist, Pingpong)
Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃
ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प Ѿҗ ࠙ࢳ
1. ѐਃ Reformer : The Efficient Transformer
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ)
Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Reformer: ৵ ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ ઓ
ਬ: য Aীࢲ য B۽ ߣೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱ ױਤ: द௫झ (512ѐ ష, ޙױ ղח ޙࢲ ױਤ) • োझۣѱ ࢤӡ ࣻ ח ࢤӡ ࣻ ח ޙ: ؊ ޙઁীب ਊೡ ࣻ ঋਸө? • ੑ۱ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ܲ ഋక ੑ۱ۄݶ? • ҃, ੑ۱ द௫झ ӡо K ױਤীࢲ ਊؽ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ ੑ۱ ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
Reformer: ৵ ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡо 64K,
߬٬ ӝо 1K, ߓࢎૉо 8ݶ Ӓ ۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ঋա? Titan-X ҃ 12GB —> ࢎप ো উ ؽ • উغח ਬ • Attention Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ਸө?
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ
Reformer Contribution 1. ѐਃ • ޙઁ ೧Ѿ • Attention Sequence
ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ডೡ ࣻ
2. ߓ҃ ध Reformer : The Efficient Transformer
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹)
Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ध -
Locality-Sensitive Hashing • ޙઁ : Nearest Neighbor Search Problem • যڃ ؘఠನੋ Qী ೧ࢲ ؘఠನੋ ࣇীࢲ о оө Xܳ Ҋ र (Nearest) • Ӓۧ݅ Point-wiseೞѱ п ನੋٜਸ ࠺Үೞח Ѫ ࠺ਊ ఀ ( ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ъ ࢸݺ: п ؘఠನੋ(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ ೣ • оө ؘఠ ನੋٜ(X1, X2)ՙܻח ੌ೮ਵݶ જѷ (H(X1) = H(X2)) • ݢ ؘఠ ನੋٜ (X1, X3)ՙܻח ੌೞ ঋওਵݶ જѷ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ۧѱ ࠗৈೡ ࣻ ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ਸ ࣻ
Locality-Sensitive Hashing द 2.1. ߓ҃ ध - Locality-Sensitive Hashing •
Locality-Sensitive Hashing ࢎਊ द: ಞߣഐ Ѩ࢝ • оө ী ಞߣഐܳ ݢ ݫӣ • (ؘఠ ನੋ: ࢲद ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࢿزҳ ڣࢻ ਭҕਗ, Hash ч: 04766) • (ؘఠ ನੋ: ࢲद ࣠ҳ ৢܿ۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻীࢲ о оө ݍਸ Ҋ रਵݶ, • ڣࢻҗ э ಞߣഐܳ о ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ٜࣗ ীࢲ оө ݍਸ Ѩ࢝ೞݶ ؽ
Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ध - Locality-Sensitive Hashing
• LSH ҳഅ ߑߨ (ਗ: оө গٜՙܻח ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ࠺ ੋؙझܳ Hash чਵ۽ ਊ • MinHash (1997): ױয ࣽࢲٜਸ ਵ۽ ࠗৈ೮ਸ ٸ, о ࡅܲ ױযо ח ഛੋ • Continuous LSH • Random Projection (2002): ಣݶী ೠ ࢎ࢚ ࠗഐ ਸ Hash чਵ۽ ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ೧ࢲ ഥ ߸ജਸ ೮ਸ ٸ, э пبҵী חоо Hashч (??)
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ؘఠࣇী 2ରਗ ߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о Ҋ о • ܳ ߈ܴ 1ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤೞח ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
Angular LSH 2.1. ߓ҃ ध - Locality-Sensitive Hashing • ઁ۽
ಽযࠁח Angular Distance ӝ߈ LSH • ઁ ௪ܻী ೠ 2ରਗ ߬٬ ߭ఠ Q = (4, 3) Ҋ о. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ܳ ڙэ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ҃, Qী ೧ࢲ X1ਸ ਸ ࣻ 1 2 3 4 1 2 3 4 1 2 3 4
Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ध
- Reversible Residual Network • ޙઁ : Residual Networkীࢲ ള۲द ݫݽܻ ग • Residual Network (ResNet, He et al. 2015) • Activation ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ܖয Network • ResNet ژೠ gradient ӝ҅ੋ ҅ਸ ਤ೧ࢲח р activation ٜਸ ೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह ഋక۽ ӝࣿೞݶ Residual Block Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅ೡ ࣻ
Reversible Residual Network 2.2. ߓ҃ ध - Reversible Residual Network
• Y = X + F(X)ী ೧ࢲ ह ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ۠ धਵ۽ ӝࣿೞח ҃, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • , Gradient ҅ਸ ۱ч݅ਸ оҊ ೡ ࣻ -> р Ѿҗ ࠛਃೣ
3. ߑߨۿ Reformer : The Efficient Transformer
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ܳ ٜয աಫۨ৪ ਤੋ Ҋ о೧ࠁݶ • یझ, ടઁ, աಫۨ৪, ҵ э ױযח оо ѪҊ • प೯೮, ঈࣻ, ࡈр, যܽ э ױযח оо ਸ Ѫ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೞݶ ࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ೧ࢲ݅ Attentionਸ ߈ೡ ࣻ ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫ हਸ ٮ
Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product
Attention • п ױযह A, Bী ೧ࢲ Aী ೧ Bо ח оח җ э ӝࣿؼ ࣻ • Query (Q) : ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ೱਸ ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ೱ۱ ӝܳ աఋղח о • ҃ Attention җ э ҅ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of
Q • Q৬ V Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ ғ shape: (batch_size, length, length) —> ݫݽܻী ٜযо ঋ • п ߓ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ • ߽۳ࢿਸ ನӝೞ݅, ݫݽܻ ࢎਊ O(L^2) ীࢲ O(L)۽ ੌ ࣻо Attention(qi , K, V) = softmax( qi KT dk ) )V
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ
Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =
K оࢸ ਊ (Shared-QK Transformer) • п ױযо ܲ ױযী ח ೱ۱ ߸ࣻח Ӓ ױযо ܲ ױয۽ࠗఠ ߉ח ೱ۱ ߸ࣻ৬ э • п ױযী ೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection э ೯۳ਸ ҕਬ • ઑӘ ࢚ೞѱ ٜܾ ࣻ ݅ पઁ प೧ࠄ Ѿҗ ࢿמী ೱਸ ঋ • ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр ؘఠ۽ рೡ ࣻ
LSH Attention 3. ߑߨۿ • Query = Key۽ Attention Sequenceܳ
ೠ ۽ աఋյ ࣻ • LSH Hash Bucketing (э Hashܳ о Queryՙܻ द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing
• Bucket ӝо Ӑ١ೞ ঋਵ۽ ੌೠ ӝ۽ Chunking • ߄۽ Chunk৬ ӝ न ࣘೠ Chunkীࢲ नҗ э Bucketਸ о গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
LSH Attention - cont. 3. ߑߨۿ • ਬ ࢎ೦ •
ੌ߈ੋ Transformerীࢲח ӝ नਸ Attendೞ݅, ҳઑীࢲח Attend ೞ ঋ • Transformer Decoding दীח ې ੋؙझܳ ࠁ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ ঋ ҃о ਵ۽ Multi Hashܳ ॄঠೣ
Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ Ӕ ࠂب
࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষաѱ ఃݶ ࠂبܳ ੌ ࣻ : ਗ ֤ޙীࢲח 16384ѐ
Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ
Ӕ ࠂب ࠺Ү (೧Ѿ???) • ৈ ޙઁо : FeedForward Layer ী ೠ ࠂب • बয, • ਗې Transformerীࢲ о ޙઁо উغחؘ l ٸޙী… • ੌױ ࠗఠ ܻܳ ೧ࠁب۾ ೧ࠁ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=
X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ҳઑ۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅ਸ ೞݶ ؽ
Contribution - Revisited. 3. ߑߨۿ • ޙઁ ೧Ѿ • Attention
Sequence ઁғਵ۽ ழ = Attention ݅ਵ۽ب ݫݽܻী ٜযо ঋ • Attention ݽٚ ױয हਸ Ҋ۰ೡ ਃо হ! ҙ۲ ח ह݅ ఋݶ ؽ • ݽ؛ Nகਵ۽ ҳࢿغݶ ೧ க Activationਸ ݫݽܻী ೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ೠ ݫݽܻ݅ ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ডೡ ࣻ
Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block ো •
Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ۧѱ ೞݶ ۽ ٜ݅যח ݫݽܻ ࢎਊب ੌ ࣻ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
Reformer दр ࠂب 3. ߑߨۿ • Reformer Ӕ दр ࠂب
4. प ࠙ࢳ Reformer : The Efficient Transformer
Duplication Experiment 4. प ࠙ࢳ • प ߑߨ: 511ӡ string
w ী ೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ೧ࢲ җ э • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझೞݶ ੜ ؽ! (Inference Hash іࣻо ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ب غݶ Full-Attentionҗ ࢿמ ࠺त
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ bit-per-dimਸ ஏ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ࢚ غݶ ࢿמ ରо ঋ
Image64 & enwik8 4. प ࠙ࢳ • प ߑߨ:
ؘఠܳ ੋ٬ -> ٣٬ೞҊ seconds per stepਸ ஏ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence ӡী ೱਸ ߉ ঋ
Ѿۿ 4. प ࠙ࢳ • ֤ޙ • Reformerח LSHܳ
Attentionী ਊೞৈ ࠺तೠ ױযٜр Attentionਸ ೡ ࣻ ب۾ ೣ • प Ѿҗ LSHܳ ਤೠ оࢸٜ ૐݺغਵݴ ࢿמਸ ਬೞݶࢲ ࠺ডਵ۽ दрਸ ੌ ࣻ • ܻীѱ दࢎೞח • Wiki8 ؘఠীࢲب ࢎਊೡ ࣻ ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ
хࢎפ Reformer : The Efficient Transformer