Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reformer: The Efficient Transformer

Scatter Lab Inc.
February 06, 2020

Reformer: The Efficient Transformer

Scatter Lab Inc.

February 06, 2020
Tweet

More Decks by Scatter Lab Inc.

Other Decks in Research

Transcript

  1. Reformer : The Efficient Transformer ݾର 1. ѐਃ 2. ߓ҃

    ૑ध 1. Locality Sensitive Hashing 2. Reversible Layer 3. ߑߨۿ 4. प೷ Ѿҗ ࠙ࢳ
  2. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ੄ ઓ੤

    ੉ਬ: ঱য Aীࢲ ঱য B۽ ߣ৉ೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱੄ ױਤ: द௫झ (512ѐ੄ ష௾, ޙױ ղ૑ח ޙࢲ ױਤ)
  3. Scaled Dot-Product Attention 1. ѐਃ • Transformerীࢲ ࢎਊغח Scaled Dot-Product

    Attention • п ױযह A, Bী ؀೧ࢲ Aী ؀೧ Bо ઱ח о઺஖ח ׮਺җ э੉ ӝࣿؼ ࣻ ੓਺ • Query (Q) : ৔ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ৔ೱਸ ઱ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ৔ೱ۱੄ ௼ӝܳ աఋղח о઺஖ • ੉ ҃਋ Attention਷ ׮਺җ э੉ ҅࢑ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
  4. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ਗې Transformer ҳઑ੄ ઓ੤

    ੉ਬ: ঱য Aীࢲ ঱য B۽ ߣ৉ೞח Taskܳ ಽӝ ਤ೧ࢲ • ੑ۱੄ ױਤ: द௫झ (512ѐ੄ ష௾, ޙױ ղ૑ח ޙࢲ ױਤ) • ੗োझۣѱ ࢤӡ ࣻ ੓ח ࢤӡ ࣻ ੓ח ૕ޙ: ؊ ௾ ޙઁীب ੸ਊೡ ࣻ ੓૑ ঋਸө? • ੑ۱੄ ױਤо ޙࢲ ױਤۄݶ? ӂ ױਤۄݶ? ׮ܲ ഋక੄ ੑ۱੉ۄݶ? • ੉޷૑੄ ҃਋, ੑ۱ द௫झ੄ ӡ੉о K ױਤীࢲ ੸ਊؽ
  5. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡ੉о 64K,

    ੐߬٬ ௼ӝо 1K, ߓ஖ࢎ੉ૉо 8੉ݶ ੑ۱ ௼ӝח 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ੓૑ ঋա? Titan-X੄ ҃਋ 12GB
  6. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡ੉о 64K,

    ੐߬٬ ௼ӝо 1K, ߓ஖ࢎ੉ૉо 8੉ݶ Ӓ ੗୓۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ੓૑ ঋա? Titan-X੄ ҃਋ 12GB —> ࢎप ׼ো൤ উ ؽ • উغח ੉ਬ • Attention਷ Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺
  7. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡ੉о 64K,

    ੐߬٬ ௼ӝо 1K, ߓ஖ࢎ੉ૉо 8੉ݶ Ӓ ੗୓۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ੓૑ ঋա? Titan-X੄ ҃਋ 12GB —> ࢎप ׼ো൤ উ ؽ • উغח ੉ਬ • Attention਷ Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ
  8. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡ੉о 64K,

    ੐߬٬ ௼ӝо 1K, ߓ஖ࢎ੉ૉо 8੉ݶ Ӓ ੗୓۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ੓૑ ঋա? Titan-X੄ ҃਋ 12GB —> ࢎप ׼ো൤ উ ؽ • উغח ੉ਬ • Attention਷ Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ
  9. Reformer: ৵ ೙ਃೠо? 1. ѐਃ • ੑ۱ द௫झ ӡ੉о 64K,

    ੐߬٬ ௼ӝо 1K, ߓ஖ࢎ੉ૉо 8੉ݶ Ӓ ੗୓۽ب 512M = 2GB • 2GBݶ ള۲दఆ ࣻ ੓૑ ঋա? Titan-X੄ ҃਋ 12GB —> ࢎप ׼ো൤ উ ؽ • উغח ੉ਬ • Attention਷ Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • ೧Ѿೡ ࣻ ੓ਸө?
  10. Reformer੄ Contribution 1. ѐਃ • ޙઁ੄ ೧Ѿ • Attention਷ Sequence੄

    ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ
  11. Reformer੄ Contribution 1. ѐਃ • ޙઁ੄ ೧Ѿ • Attention਷ Sequence੄

    ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ؀ೠ ݫݽܻ݅ ೙ਃೣ
  12. Reformer੄ Contribution 1. ѐਃ • ޙઁ੄ ೧Ѿ • Attention਷ Sequence੄

    ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ؀ೠ ݫݽܻ݅ ೙ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Attention Chunkী ؀೧ࢲ݅ Feed-Foward Networkܳ ఋݶ ݫݽܻܳ ੺ডೡ ࣻ ੓਺
  13. Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ૑ध -

    Locality-Sensitive Hashing • ޙઁ ੿੄: Nearest Neighbor Search Problem • যڃ ؘ੉ఠನੋ౟ Qী ؀೧ࢲ ੹୓ ؘ੉ఠನੋ౟ ࣇীࢲ о੢ оө਍ Xܳ ଺Ҋ र਺ (Nearest) • Ӓۧ૑݅ Point-wiseೞѱ п ನੋ౟ٜਸ ࠺Үೞח Ѫ਷ ࠺ਊ੉ ఀ (૘೤੄ ௼ӝী ࠺۹)
  14. Locality-Sensitive Hashing - ޙઁ ߂ ѐ֛ 2.1. ߓ҃ ૑ध -

    Locality-Sensitive Hashing • ޙઁ ੿੄: Nearest Neighbor Search Problem • যڃ ؘ੉ఠನੋ౟ Qী ؀೧ࢲ ੹୓ ؘ੉ఠನੋ౟ ࣇীࢲ о੢ оө਍ Xܳ ଺Ҋ र਺ (Nearest) • Ӓۧ૑݅ Point-wiseೞѱ п ನੋ౟ٜਸ ࠺Үೞח Ѫ਷ ࠺ਊ੉ ఀ (૘೤੄ ௼ӝী ࠺۹) • ѐ֛ ࢸݺ: Locality-Sensitive Hashing • ؀ъ ࢸݺ: п ؘ੉ఠನੋ౟(X1, X2, X3, …)ٜী Hash(H(X1), H(X2), H(X3), …)чਸ ࠗৈೞҊ੗ ೣ • оө਍ ؘ੉ఠ ನੋ౟ٜ(X1, X2)ՙܻח ੌ஖೮ਵݶ જѷ਺ (H(X1) = H(X2)) • ݢ ؘ੉ఠ ನੋ౟ٜ (X1, X3)ՙܻח ੌ஖ೞ૑ ঋওਵݶ જѷ਺ (H(X1) ≠ H(X3)) • ݅ড Hashчਸ ੷ۧѱ ࠗৈೡ ࣻ ੓ਵݶ H(Q) = H(X)ੋ Xܳ ࡅܰѱ ଺ਸ ࣻ ੓਺
  15. Locality-Sensitive Hashing ৘द 2.1. ߓ҃ ૑ध - Locality-Sensitive Hashing •

    Locality-Sensitive Hashing੄ ࢎਊ ৘द: ਋ಞߣഐ Ѩ࢝ • оө਍ ૑৉ী ਋ಞߣഐܳ ݢ੷ ݫӣ • (ؘ੉ఠ ನੋ౟: ࢲ਎द ࢿزҳ KDఋਕ 902ഐ, Hash ч: 04766) • (ؘ੉ఠ ನੋ౟: ࢲ਎द ࢿزҳ ڣࢻ ୓ਭҕਗ, Hash ч: 04766) • (ؘ੉ఠ ನੋ౟: ࢲ਎द ࣠౵ҳ ৢܿ೗۽ 99, Hash ч: 05501) • ࢿزҳ ڣࢻ৉ীࢲ о੢ оө਍ ݍ૘ਸ ଺Ҋ रਵݶ, • ڣࢻ৉җ э਷ ਋ಞߣഐܳ о૓ ઱ٜࣗਸ ୶ܿ (Hash ч: 04766) • Ӓ ઱ٜࣗ ઺ীࢲ оө਍ ݍ૘ਸ Ѩ࢝ೞݶ ؽ
  16. Locality-Sensitive Hashing ҳഅ ߑߨ 2.1. ߓ҃ ૑ध - Locality-Sensitive Hashing

    • LSH੄ ҳഅ ߑߨ (ਗ஗: оө਍ গٜՙܻח ੐੄੄ ࢶഋ߸ജ Ѿҗޛب ࠺तೡ Ѫ) • Discrete LSH • Bit Sampling (1998): ੐੄੄ ࠺౟ ੋؙझܳ Hash чਵ۽ ੉ਊ • MinHash (1997): ױয ࣽࢲٜਸ ੐੄ਵ۽ ࠗৈ೮ਸ ٸ, о੢ ࡅܲ ױযо ੓ח૑ ഛੋ • Continuous LSH • Random Projection (2002): ੐੄੄ ಣݶী ؀ೠ ࢎ࢚੄ ࠗഐ ૘೤ਸ Hash чਵ۽ ੉ਊ • Angular Distance (2015): • ҳഋਵ۽ ࢎ࢚ೠ ߭ఠী ؀೧ࢲ ੐੄੄ ഥ੹ ߸ജਸ ೮ਸ ٸ, э਷ пبҵী ੓חоо Hashч (??)
  17. Angular LSH 2.1. ߓ҃ ૑ध - Locality-Sensitive Hashing • ৘ઁ۽

    ಽযࠁח Angular Distance ӝ߈ LSH • ؘ੉ఠࣇী 2ରਗ ੐߬٬ ߭ఠ X1 = (3, 4), X2 = (-12, 5) о ੓׮Ҋ о੿ • ੉ܳ ߈૑ܴ 1૞ܻ ҳী ࢎ࢚ೞݶ X1’ = (3/5, 4/5), X2’ = (-12/13, 5/13) • ਗਸ ج۰ࠁݶࢲ ݻ ࢎ࠙ݶী ਤ஖ೞח૑ ӝ۾: H(X1’) = (1, 4, 2), H(X2’) = (2, 2, 3) 1 2 3 4 1 2 3 4 1 2 3 4
  18. Angular LSH 2.1. ߓ҃ ૑ध - Locality-Sensitive Hashing • ৘ઁ۽

    ಽযࠁח Angular Distance ӝ߈ LSH • ੉ઁ ௪ܻী ؀ೠ 2ରਗ ੐߬٬ ߭ఠ Q = (4, 3) ੉ ੓׮Ҋ о੿. ࢎ࢚ೞݶ, Q’ = (4/5, 3/5) • ੉ܳ ڙэ੉ ج۰ࠁݶ H(Q’) = (1, 4, 2) = H(X1) • ٮۄࢲ ੉ ҃਋, Qী ؀೧ࢲ X1ਸ ଺ਸ ࣻ ੓਺ 1 2 3 4 1 2 3 4 1 2 3 4
  19. Reversible Residual Network - ޙઁ ߂ ѐ֛ 2.2. ߓ҃ ૑ध

    - Reversible Residual Network • ޙઁ ੿੄: Residual Networkীࢲ੄ ള۲द ݫݽܻ ੉ग • Residual Network (ResNet, He et al. 2015) • Activation੄ ഋకо y = x + F(x) ۽ ӝࣿغח Residual Block۽ ੉ܖয૓ Network • ResNet ژೠ gradient੄ ӝ҅੸ੋ ҅࢑ਸ ਤ೧ࢲח ઺р activation ٜਸ ੷੢೧ঠೣ • ѐ֛ ࢸݺ: Reversible Residual Network (Gomez et al. 2017) • Activation Ѿҗܳ ह੄ ഋక۽ ӝࣿೞݶ Residual Block੄ Ѿҗޛ݅ਵ۽ Backward pass۽ ੑ ۱ਸ ҅࢑ೡ ࣻ ੓਺
  20. Reversible Residual Network 2.2. ߓ҃ ૑ध - Reversible Residual Network

    • Y = X + F(X)ী ؀೧ࢲ ह੄ ഋక۽ ӝࣿ (X = (X1, X2)) • Y1= X1+F(X2), Y2 = X2 + G(Y1) • ੉۠ धਵ۽ ӝࣿೞח ҃਋, Y2৬ Y1ਵ۽ࠗఠ X1җ X2ܳ ࠂਗೡ ࣻ ੓਺ • X2 = Y2 - G(Y1), X1 = Y - F(X2) • ૊, Gradient੄ ҅࢑ਸ ୹۱ч݅ਸ о૑Ҋ ೡ ࣻ ੓਺ -> ઺р Ѿҗ੄ ੷੢੄ ࠛ೙ਃೣ
  21. Contribution - Revisited. 3. ߑߨۿ • ޙઁ੄ ೧Ѿ • Attention਷

    Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ؀ೠ ݫݽܻ݅ ೙ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ੺ডೡ ࣻ ੓਺
  22. Contribution - Revisited. 3. ߑߨۿ • ޙઁ੄ ೧Ѿ • Attention਷

    Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ৘ܳ ٜয աಫۨ৪ ਤੋ੹੉ ੓׮Ҋ о੿೧ࠁݶ • ೐یझ, ടઁ, աಫۨ৪, ੢ҵ э਷ ױযח о઺஖о ௿ Ѫ੉Ҋ • प೯೮׮, ঈࣻ, ࡈр, যܽ э਷ ױযח о઺஖о ੘ਸ Ѫ੐
  23. Contribution - Revisited. 3. ߑߨۿ • ޙઁ੄ ೧Ѿ • Attention਷

    Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ࠺तೠ ױযٜী ؀೧ࢲ݅ Attentionਸ ߈৔ೞݶ ୽࠙ೡ Ѫ • ޙઁח যڌѱ ࠺तೠ ױযٜী ؀೧ࢲ݅ Attentionਸ ߈৔ೡ ࣻ ੓ਸ Ѫੋо? • Query৬ Keyٜਸ Locality-Sensitive Hashingೞৈ ਬࢎبо ֫਷ हਸ ٮ૗
  24. Scaled Dot-Product Attention 3. ߑߨۿ • Transformerীࢲ ࢎਊغח Scaled Dot-Product

    Attention • п ױযह A, Bী ؀೧ࢲ Aী ؀೧ Bо ઱ח о઺஖ח ׮਺җ э੉ ӝࣿؼ ࣻ ੓਺ • Query (Q) : ৔ೱਸ ߉ח ױয A۽ࠗఠ աৡ ߸ࣻ • Key (K) : ৔ೱਸ ઱ח ױয B۽ࠗఠ աৡ ߸ࣻ • Value (V): ৔ೱ۱੄ ௼ӝܳ աఋղח о઺஖ • ੉ ҃਋ Attention਷ ׮਺җ э੉ ҅࢑ؽ Attention(Q, K, V) = softmax( QKT dk ) )V
  25. Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Decomposition of

    Q • Q৬ V੄ Shape: (batch_size, length, hidden_dim) • ف ߸ࣻ੄ ғ੄ shape: (batch_size, length, length) —> ݫݽܻী ٜযо૑ ঋ਺ • п ߓ஖੄ Qܳ (q1, q2, …. q_length) ۽ ଂѐݶ ݫݽܻী ٜযт ࣻ ੓਺ • ߽۳ࢿਸ ನӝೞ૑݅, ݫݽܻ ࢎਊ੉ O(L^2) ীࢲ O(L)۽ ઴ੌ ࣻо ੓਺ Attention(qi , K, V) = softmax( qi KT dk ) )V
  26. Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =

    K оࢸ ଻ਊ (Shared-QK Transformer) • п ױযо ׮ܲ ױযী ઱ח ৔ೱ۱ ߸ࣻח Ӓ ױযо ׮ܲ ױয۽ࠗఠ ߉ח ৔ೱ۱ ߸ࣻ৬ э਺ • п ױযী ؀೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection਷ э਷ ೯۳ਸ ҕਬ • ઑӘ ੉࢚ೞѱ ٜܾ ࣻ ੓૑݅ पઁ प೷೧ࠄ Ѿҗ ࢿמী ৔ೱਸ ઱૑ ঋ਺
  27. Scaled Dot-Product Attention - cont. 3. ߑߨۿ • Q =

    K оࢸ ଻ਊ (Shared-QK Transformer) • п ױযо ׮ܲ ױযী ઱ח ৔ೱ۱ ߸ࣻח Ӓ ױযо ׮ܲ ױয۽ࠗఠ ߉ח ৔ೱ۱ ߸ࣻ৬ э਺ • п ױযী ؀೧ࢲ Qܳ ݅٘ח Projectionҗ Kܳ ݅٘ח Projection਷ э਷ ೯۳ਸ ҕਬ • ઑӘ ੉࢚ೞѱ ٜܾ ࣻ ੓૑݅ पઁ प೷೧ࠄ Ѿҗ ࢿמী ৔ೱਸ ઱૑ ঋ਺ • ੉ ߑߨਸ ా೧ࢲ Q৬ Kܳ زੌೠ ҕр੄ ؘ੉ఠ۽ р઱ೡ ࣻ ੓਺
  28. LSH Attention 3. ߑߨۿ • Query = Key੉޲۽ Attention Sequenceܳ

    ೠ ઴۽ աఋյ ࣻ ੓਺ • LSH Hash Bucketing (э਷ Hashܳ о૓ Queryՙܻ ಴द) • Sorting by Bucketing q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
  29. LSH Attention - cont. 3. ߑߨۿ • Sorting by Bucketing

    • Bucket੄ ௼ӝо Ӑ١ೞ૑ ঋਵ޲۽ ੌ੿ೠ ௼ӝ؀۽ Chunking • ߄۽ ૒੹ Chunk৬ ੗ӝ ੗न੉ ࣘೠ Chunkীࢲ ੗नҗ э਷ Bucketਸ о૓ গٜՙܻ Attend q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8 q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
  30. LSH Attention - cont. 3. ߑߨۿ • ਬ੄ ࢎ೦ •

    ੌ߈੸ੋ Transformerীࢲח ੗ӝ ੗नਸ Attendೞ૑݅, ੉ ҳઑীࢲח Attend ೞ૑ ঋ਺ • Transformer Decoding दীח ޷ې ੋؙझܳ ࠁ૑ ঋইঠ ೣ (i > j) • ೠ Hash Bucket Schemeਵ۽ Ҁ஖૑ ঋ਷ ҃਋о ੓ਵ޲۽ Multi Hashܳ ॄঠೣ
  31. Memory Complexity Problem 3. ߑߨۿ • ӝઓ ߑߨۿҗ੄ ੼Ӕ੸ ࠂ੟ب

    ࠺Ү (೧Ѿ!) (n_r: Hash ߈ࠂࣻ, l: ӡ੉, n_c: Hash chunk ࣻ) • Hash chunk ࣻܳ ষ୒աѱ ః਋ݶ ࠂ੟بܳ ઴ੌ ࣻ ੓਺: ਗ ֤ޙীࢲח 16384ѐ
  32. Memory Complexity Problem - cont. 3. ߑߨۿ • ӝઓ ߑߨۿҗ੄

    ੼Ӕ੸ ࠂ੟ب ࠺Ү (೧Ѿ???) • ৈ੹൤ ޙઁо ੓਺: FeedForward Layer ী ੄ೠ ࠂ੟ب • ब૑য, • ਗې Transformerীࢲ о ޙઁо উغ঻חؘ l ٸޙী… • ੌױ ࠗఠ ୊ܻܳ ೧ࠁب۾ ೧ࠁ੗ b ⋅ nh ⋅ l ⋅ dk ⋅ nl b ⋅ nh ⋅ l ⋅ df f ⋅ nl df f nl
  33. Contribution - Revisited. 3. ߑߨۿ • ޙઁ੄ ೧Ѿ • Attention਷

    Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ؀ೠ ݫݽܻ݅ ೙ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ੺ডೡ ࣻ ੓਺
  34. Reversible Transformer 3. ߑߨۿ • Reversible Transformer Revisited • Y1=

    X1+F(X2), Y2 = X2 + G(Y1) • Transformer Block੄ ҳઑ • Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • ੉ ҳઑ؀۽ۄݶ ೠ ߣী ೠ கঀ Activation ҅࢑ਸ ೞݶ ؽ
  35. Contribution - Revisited. 3. ߑߨۿ • ޙઁ੄ ೧Ѿ • Attention਷

    Sequence੄ ઁғਵ۽ ழ૗ = Attention ݅ਵ۽ب ݫݽܻী ٜযо૑ ঋ਺ • Attention਷ ݽٚ ױয हਸ Ҋ۰ೡ ೙ਃо হ਺! ҙ۲ ੓ח ह݅ ఋݶ ؽ • ݽ؛੉ Nகਵ۽ ҳࢿغݶ ೧׼ க੄ Activationਸ ׮ ݫݽܻী ੷੢೧ঠೣ • Reversible Layer ҳઑܳ ࢎਊೞݶ ೠ கী ؀ೠ ݫݽܻ݅ ೙ਃೣ • Attention ࡺ݅ ইפۄ Feed-Forward Networkо ࢎਊೞח ݫݽܻب ٮઉঠೣ • п Feed-Foward Networkܳ Chunk۽ ଂѐݶ ݫݽܻܳ ੺ডೡ ࣻ ੓਺
  36. Chunked Reversible Transformer 3. ߑߨۿ • Chunked Block੄ ো࢑ •

    Y1 = X1+ Attention (X2), Y2 = X2 + FeedForward(Y1) • Y2 = [Y2(1); Y2(2); … Y2(c)] = [X2(1)+FeedForward(Y1(1)); … ] • ੉ۧѱ ೞݶ ۽ ٜ݅য૑ח ݫݽܻ ࢎਊ۝ب ઴ੌ ࣻ ੓਺ df f q1 q4 q6 q9 q10 q2 q11 q5 q7 q12 q3 q8
  37. Duplication Experiment 4. प೷ ࠙ࢳ • प೷ ߑߨ: 511ӡ੉੄ string

    w ী ؀೧ࢲ 0w0w pattern stringਸ generation • 1-layer, 4-head, 256 dim ী ؀೧ࢲ ׮਺җ э਺ • Hash 1ѐ۽ ള۲दఅ ݽ؛ب 8ѐ Hash۽ పझ౟ೞݶ ੜ ؽ! (Inference Hash іࣻо ઺ਃ) W1 W2 W3 W4 W5 W6 W7 W8 S 0 91 7 48 0 91 7 48 W1 W2 W3 S 91 7 48
  38. Image64 & enwik8 4. प೷ ࠙ࢳ • प೷ ߑߨ: ؀۝੄

    ؘ੉ఠܳ ੋ௏٬ -> ٣௏٬ೞҊ bit-per-dimਸ ஏ੿ • Q=K оࢸҗ Reversible оࢸਸ Ѩૐ -> ੜ ࣻ۴ؽ
  39. Image64 & enwik8 4. प೷ ࠙ࢳ • प೷ ߑߨ: ؀۝੄

    ؘ੉ఠܳ ੋ௏٬ -> ٣௏٬ೞҊ bit-per-dimਸ ஏ੿ • Hash іࣻܳ ഛੋ -> 8 Hash, 16 Hash ੿ب غݶ Full-Attentionҗ ࢿמ੉ ࠺त
  40. Image64 & enwik8 4. प೷ ࠙ࢳ • प೷ ߑߨ: ؀۝੄

    ؘ੉ఠܳ ੋ௏٬ -> ٣௏٬ೞҊ bit-per-dimਸ ஏ੿ • Layer கࣻী ٮܲ ࢿמ ഛੋ -> 6க ੉࢚ غݶ ࢿמ ର੉о ௼૑ ঋ਺
  41. Image64 & enwik8 4. प೷ ࠙ࢳ • प೷ ߑߨ: ؀۝੄

    ؘ੉ఠܳ ੋ௏٬ -> ٣௏٬ೞҊ seconds per stepਸ ஏ੿ • Hash іࣻী ٮܲ ࣘب ࢿמ -> Reformerח Sequence੄ ӡ੉ী ௾ ৔ೱਸ ߉૑ ঋ਺
  42. Ѿۿ 4. प೷ ࠙ࢳ • ֤ޙ੄ ઱੢ • Reformerח LSHܳ

    Attentionী ੸ਊೞৈ ࠺तೠ ױযٜр੄ Attentionਸ ೡ ࣻ ੓ب۾ ೣ • प೷ Ѿҗ LSHܳ ਤೠ оࢸٜ੉ ૐݺغ঻ਵݴ ࢿמਸ ਬ૑ೞݶࢲ ࠺ড੸ਵ۽ दрਸ ઴ੌ ࣻ ੓਺ • ਋ܻীѱ दࢎೞח ੼ • Wiki8 ؘ੉ఠীࢲب ࢎਊೡ ࣻ ੓ח Ѫਸ ࠁওਸ ٸ, NLPীࢲب ഝਊ оמೡ Ѫਵ۽ ੹ݎ • Reformerܳ ߓઁೞ؊ۄب LSHח Ӕदੌী दب೧ࠅ ݅ೣ