Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文紹介 "Informer: Beyond Efficient Transformer for Long SequenceTime-Series Forecasting"

論文紹介 "Informer: Beyond Efficient Transformer for Long SequenceTime-Series Forecasting"

AAAI2021"Informer: Beyond Efficient Transformer for Long SequenceTime-Series Forecasting"の論文紹介用のスライドです。

taichi_murayama

August 14, 2021
Tweet

More Decks by taichi_murayama

Other Decks in Research

Transcript

  1. 2 Title: Informer: Beyond Efficient Transformer for Long Sequence Time-Series

    Forecasting Author: Haoyi Zhou1, Shanghang Zhang2, Jieqi Peng1, Shuai Zhang1, Jianxin Li1, Hui Xiong3, Wancai Zhang4 (1 Beihang University (๺ژߤۭߤఱେֶ), 2 UC Berkeley, 3 Rutgers University, 4 SEDD Company) Conference: AAAI 2021 (Thirty-Fifth AAAI Conference on Artificial Intelligence) AAAI-21 Outstanding Paper! Bibliographic Information
  2. 3 • ਓ޻஌ೳܥͷࠃࡍձٞͷτοϓΧϯϑΝϨϯε AAAIͷOutstanding Paper (͢͝ ͍ձٞͷதͰ΋͍͢͝࿦จ) • ௕ظతͳܥྻ༧ଌΛऔΓѻͬͨ࿦จ •

    Transformerͱ͸Ͳ͏͍͏΋ͷ͔Λཧղ Purpose: Why I choose this paper. ҰݴͰݴ͏ͱ… ʮ௕ظతͳܥྻ༧ଌΛ໨తͱͨ͠ɼܭࢉίετ΍ϝϞϦίετΛ ࡟ݮͨ͠TransformerͷมछΛఏҊ͠༧ଌੑೳΛ޲্ʯ • ܥྻ༧ଌλεΫʹ͓͍ͯTransformer(ଞͷਂ ૚ֶशϞσϧ΋ؚΉ)͕ۤखͳ఺͸Ͳ͜ͳͷ ͔ʁͲ͏͍͏ղܾࡦ͕͋Δ͔ʹ͍ͭͯཧղ
  3. 4 Motivation • աڈͷܥྻ͔Β௕ظతͳকདྷͷܥྻ ྫ͑͹ϙΠϯτઌ΍ि ؒઌͷܥྻ Λ༧ଌ • ӳޠͰ͸಄จࣈΛऔͬͯ-45'ͱݺশ •

    طଘͷϞσϧͰ͸ࠔ೉ ྫ-45.ʹΑΔܥྻ༧ଌ औΓ૊Έ͍ͨ՝୊௕ظతͳܥྻ༧ଌ -POH4FRVFODF5JNFTFSJFT'PSFDBTUJOH MSE score: ⻑期先の予測で 予測精度が低下 Inference speed :推論速度の低下
  4. 5 Motivation • 5BTLաڈͷܥྻ͔Β௕ظతͳকདྷͷܥྻΛ༧ଌ͢Δͱ͍͏ܥྻ ༧ଌ໰୊ • *OQVU𝑋! = 𝑥" !,

    𝑥# !, … , 𝑥$ ! 𝑥% ! ∈ ℝ&!} • 0VUQVU𝑌! = 𝑦" !, 𝑦# !, … , 𝑦$ ! 𝑦% ! ∈ ℝ&"} औΓ૊Έ͍ͨ՝୊௕ظతͳܥྻ༧ଌ -POH4FRVFODF5JNFTFSJFT'PSFDBTUJOH
  5. 7 -45'ʹର͢Δ5SBOTGPSNFS׆༻ͷ՝୊ • 5SBOTGPSNFSͷ௕ॴ • ࣗવݴޠॲཧ΍ը૾ॲཧͳͲʹݶΒͣܥྻ༧ଌʹ͓͍ͯ΋ߴ͍ਫ਼౓Λୡ੒ [Wu, 2020], [Yu, 2020]

    • ௕ظܥྻͷೖྗ಺Ͱͷؔ܎΍ΞϥΠϝϯτΛͱΔ͜ͱ͕Մೳ • 5SBOTGPSNFSͷ୹ॴ • ௕ظܥྻͷೖྗ΍ग़ྗͷܭࢉ΍ϝϞϦͷίετ͕ߴ্͘ख͍͔͘ͳ͍ • ௕ظܥྻͷਪ࿦ʹ͕͔͔࣌ؒΔ Motivation
  6. 8 -45'ʹର͢Δ5SBOTGPSNFS׆༻ͷ՝୊ • 5SBOTGPSNFSͷ௕ॴ • ࣗવݴޠॲཧ΍ը૾ॲཧͳͲʹݶΒͣܥྻ༧ଌʹ͓͍ͯ΋ߴ͍ਫ਼౓Λୡ੒ [Wu, 2020], [Yu, 2020]

    • ௕ظܥྻͷೖྗ಺Ͱͷؔ܎΍ΞϥΠϝϯτΛͱΔ͜ͱ͕Մೳ • 5SBOTGPSNFSͷ୹ॴ • ௕ظܥྻͷೖྗ΍ग़ྗͷܭࢉ΍ϝϞϦͷίετ͕ߴ্͘ख͍͔͘ͳ͍ • ௕ظܥྻͷਪ࿦ʹ͕͔͔࣌ؒΔ Motivation ͦ΋ͦ΋5SBOTGPSNFSͱ͸Կ͔ʁ
  7. 9 5SBOTGPSNFSͱ͸ʁ • (PPHMF͕ʮ"UUFOUJPOJTBMMZPVOFFEʯͰఏҊ [Vaswani, 2017] • ໊લͷͱ͓Γɼ"UUFOUJPO͕ओཁͳߏ੒ཁૉ • ࣗવݴޠॲཧ

    /-1 ͚ͩͰͳ͘ɼ࠷ۙͰ͸$7 ͳͲͷଞ෼໺Ͱ΋༷ʑͳݚڀͰ׆༻ FY#&35 (15 7J5 • 5SBOTGPSNFSҎલ͸$//ͱ3//͕த৺ What’s Transformer?
  8. 16 5SBOTGPSNFSͷத਎ What’s Transformer? • جຊ͸&ODPEFS%FDPEFSϞσϧ • 1PTJUJPOBM&ODPEJOHͱ͍͏֤ೖྗͷ Ґஔ৘ใΛϕΫτϧͰදݱ •

    &ODPEFS %FDPEFSͱ΋ʹ .VMUJIFBEBUUFOUJPOͱ'FFE'PSXBSEͰ ߏ੒͞Εͨ5SBOTGPSNFS#MPDLͷੵΈॏͶ Transformer Block
  9. 19 5SBOTGPSNFS#MPDL What’s Transformer? • &ODPEFSɼ%FDPEFSͱͱ΋ʹ5SBOTGPSNFS #MPDLͷੵΈॏͶͰߏ੒ • 5SBOTGPSNFS#MPDLͷߏ੒ .VMUJ)FBE4FMGBUUFOUJPO

    3FTJEVBM$POOFDUJPO ࢒ࠩ઀ଓ -BZFS/PSNBMJ[BUJPO 1PTJUJPOXJTF'FFE'PSXBSE %SPQPVU Transformer Block
  10. 20 4FMGBUUFOUJPO What’s Transformer? ⼊⼒系列の 潜在表現 系列⻑ × 次元数 Key

    : K Query: Q 𝑊' 𝑊( Value: V Attention Map : M 系列⻑ × 系列⻑ 𝑊) 𝑊*+! Output 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑄𝐾, 𝑑 )
  11. 21 4FMGBUUFOUJPO What’s Transformer? Value: V Attention Map : M

    系列⻑ × 系列⻑ 𝑊*+! Output 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑄𝐾, 𝑑 )
  12. 23 -45'ʹର͢Δ5SBOTGPSNFS׆༻ͷ՝୊ ࠶ܝ • 5SBOTGPSNFSͷ௕ॴ • ࣗવݴޠॲཧ΍ը૾ॲཧͳͲʹݶΒͣܥྻ༧ଌʹ͓͍ͯ΋ߴ͍ਫ਼౓Λୡ੒ [Wu, 2020], [Yu,

    2020] • ௕ظܥྻͷೖྗ಺Ͱͷؔ܎΍ΞϥΠϝϯτΛͱΔ͜ͱ͕Մೳ • 5SBOTGPSNFSͷ୹ॴ • ௕ظܥྻͷೖྗ΍ग़ྗͷܭࢉ΍ϝϞϦͷίετ͕ߴ্͘ख͍͔͘ͳ͍ • ௕ظܥྻͷਪ࿦ʹ͕͔͔࣌ؒΔ Motivation ఏҊख๏*OGPSNFSͰղܾ
  13. 24 5SBOTGPSNFSͷ$PNQVUBUJPOBM$PNQMFYJUZ Method Complexity per Layer Convolutional 𝑂 𝐾 /

    𝐷! / 𝐿 Recurrent 𝑂 𝐿 / 𝐷! Self-attention (Transformer) 𝑂 𝐿! / 𝐷 Method: • ௕ظܥྻͰͳ͚Ε͹ %-ͱͳΓܭࢉ࣌ؒ͸ͦ͜·Ͱ໰୊ͳ͍͕ɼ ௕ظܥྻΛೖྗͱ͢Δͱɼ-%ͱͳΓܭࢉ࣌ؒ΍ϝϞϦ͕໰୊ʹ • ߋʹɼ-BZFSΛੵΈॏͶΔ͜ͱͰܭࢉ͕࣌ؒ/ഒʹ K: the length of filter D: dimensionality of space L: input length N: Number of layers Computational Complexityは 𝑶 𝑵×(𝑳𝟐 & 𝑫)
  14. 25 *OGPSNFSʹΑΔ$PNQMFYJUZ࡟ݮ Method: ̎ͭͷ$PNQMFYJUZ࡟ݮख๏ΛఏҊ • 1SPC4QBSTFॏཁ౓ͷߴ͍"UUFOUJPO.BQͷΈΛར༻ 𝑂 𝐿) # 𝐷

    ˠ 𝑂 𝐿 log 𝐿 # 𝐷 ʹ࡟ݮ • 4FMGBUUFOUJPO%JTUJMMJOH ηϧϑΞςϯγϣϯ૚Λग़Δ౓ ܥྻͷ௕͕͞൒෼ʹͳΔΑ͏ৠཹ 𝑂 N # ⋯ ˠ 𝑂 2 − 𝜖 # ⋯ ʹ࡟ݮ
  15. 26 4FMGBUUFOUJPO ࠶ܝ Method: ProbSparse ⼊⼒系列の 潜在表現 Key : K

    Query: Q 𝑊' 𝑊( Value: V Attention Map : M 系列⻑ × 系列⻑ 𝑊) 𝑊*+! Output 系列⻑(L) × 次元数(D) 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑄𝐾, 𝐷 )
  16. 27 1SPC4QBSTF Method: ProbSparse ⼊⼒系列の 潜在表現 系列⻑(L) × 次元数(D) Key

    : K Query: 2 𝑄 𝑊' 𝑊( Value: V Attention Map : M 系列⻑ × 系列⻑ 𝑊) 𝑊*+! Output 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 8 𝑄𝐾, 𝐷 ) 上位u件のQuery のみを利⽤
  17. 28 1SPC4QBSTF Method: ProbSparse ⼊⼒系列の 潜在表現 系列⻑(L) × 次元数(D) Key

    : K Query: 2 𝑄 𝑊' 𝑊( Value: V Attention Map: M 系列⻑ × 系列⻑ 𝑊) 𝑊*+! Output 𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 8 𝑄𝐾⊺ 𝐷 ) 上位u件のQuery のみを利⽤ 1: その上位はどのように選択するのか? 2: 計算効率はどれぐらい良くなるの?
  18. 33 1SPC4QBSTFॏཁ౓ͷߴ͍2VFSZͷબ୒ Method: ProbSparse 𝐾𝐿 𝑞 𝑝 = 1 𝐿

    9 57" $ 𝑙𝑜𝑔 1 𝐿8" − 𝑙𝑜𝑔𝑍% + 𝑞%𝑘5 𝐷 = 𝑙𝑜𝑔 1 𝐿8" − 𝑙𝑜𝑔𝑍% + 1 𝐿 9 57" $ 𝑞%𝑘5 𝐷 = 𝑙𝑜𝑔 1 𝐿8" − 𝑙𝑜𝑔 9 57" $ exp( 𝑞%𝑘5 𝐷 ) + 1 𝐿 9 57" $ 𝑞%𝑘5 𝐷 ⼀様分布とAttention MapのKL距離 𝑞 = 1 𝐿 𝑝 = 𝑒𝑥𝑝 𝑞"𝑘# 𝐷 ∑$ 𝑒𝑥𝑝 𝑞", 𝑘$ 𝐷 = 𝑒𝑥𝑝 𝑞"𝑘# & 𝐷% & ' 𝑍" ⼀様分布 Query 𝒒𝒊 における Attention Map 𝑀 𝑞", 𝐾 = 𝑙𝑛 ; #(& )! 𝑒 *"+# ⊺ , − 1 𝐿- ; #(& )! 𝑞"𝑘# ⊺ 𝐷 Sparsity Measurement 定数項 -1 × Sparsity Measurement ref: https://cookie-box.hatenablog.com/entry/2021/02/11/195
  19. 34 1SPC4QBSTFܭࢉޮ཰ͷ໰୊ Method: ProbSparse 4QBSTJUZ.FBTVSFNFOUͷܭࢉࣗମ͕𝑶 𝑳𝟐 ͜ͷܭࢉࣜΛҎԼͷΑ͏ʹLFZWFDUPS͔ΒαϯϓϦϯάʹΑΔۙࣅ Λߦ͏͜ͱͰ 𝑂 L

    log 𝐿 Λୡ੒ (Max − mean Measurementͷܗ) 𝑀 𝑞% , 𝐾 = 𝑙𝑛 9 57" $# 𝑒 :$;% ⊺ & − 1 𝐿' 9 57" $# 𝑞% 𝑘5 ⊺ 𝐷 L 𝑀 𝑞% , 𝐾 = max 5 𝑞% 𝑘5 ⊺ 𝐷 − 1 𝐿' 9 57" $# 𝑞% 𝑘5 ⊺ 𝐷
  20. 35 ,-ڑ཭≥ 0Ͱ͋Δ͜ͱ͔Βɼ𝐾𝐿 𝑞 𝑝 = 𝑙𝑜𝑔 > ?!" −

    𝑀 𝑞* , 𝐾 ΑΓ ·ͨҎԼͷࣜʹΑΓ 1SPC4QBSTFܭࢉޮ཰ͷ໰୊ Method: ProbSparse 𝑀 𝑞%, 𝐾 = 𝑙𝑛 9 57" $# 𝑒 :$;% ⊺ & − 1 𝐿' 9 57" $# 𝑞% 𝑘5 ⊺ 𝐷 ≤ 𝑙𝑛 𝐿 P max 5 𝑒 :$;% ⊺ & − 1 𝐿' 9 57" $# 𝑞% 𝑘5 ⊺ 𝐷 = 𝑙𝑛𝐿 + max 5 :$;% ⊺ & − " $# ∑ 57" $# :$;% ⊺ & 𝑙𝑜𝑔𝐿 ≤ 𝑀 𝑞! , 𝐾 = G 𝑀 𝑞* , 𝐾 : Sparsity Measurementの近似解
  21. 37 *OGPSNFSʹΑΔ$PNQMFYJUZ࡟ݮ ࠶ܝ Method: ̎ͭͷ$PNQMFYJUZ࡟ݮख๏ΛఏҊ • 1SPC4QBSTFॏཁ౓ͷߴ͍"UUFOUJPO.BQͷΈΛར༻ 𝑂 𝐿) #

    𝐷 ˠ 𝑂 𝐿 log 𝐿 # 𝐷 ʹ࡟ݮ • 4FMGBUUFOUJPO%JTUJMMJOH ηϧϑΞςϯγϣϯ૚Λग़Δ౓ ܥྻͷ௕͕͞൒෼ʹͳΔΑ͏ৠཹ 𝑂 N # ⋯ ˠ 𝑂 2 − 𝜖 # ⋯ ʹ࡟ݮ
  22. 39 Self-attention Distilling Method: Self-attention Distilling K൪໨ͷK ൪໨ͷϨΠϠʔʹೖྗ͢Δͱ͖ʹɼ$POWE LFSOFMXJEUI ͱ.BY1PPMJOHΛ௨͢͜ͱͰܥྻ௕Λѹॖ

    ϨΠϠʔ਺෼͔ΒഒҎԼͷ$PNQMFYJUZʹѹॖ 𝑂 N # ⋯ ˠ 𝑂 2 − 𝜖 # ⋯ 𝑋@A> B = 𝑀𝑎𝑥𝑃𝑜𝑜𝑙 𝐸𝐿𝑈 𝐶𝑜𝑛𝑣1𝑑 𝑋@ B
  23. 41 Method: Decoder Outputs through one forward "VUPSFHSFTTJWFʹΑΔਪ࿦ [Chen, 2019]

    /POBVUPSFHSFTTJWFʹΑΔਪ࿦ ճؼʹΑΒͳ͍ਪ࿦ Informerはこれを採⽤ 推論が早くなるというメリット
  24. 42 Experiment: Dataset • &5'தࠃͷͭͷ஍Ҭͷిྗڙڅͷσʔλ BVUIPSTDPMMFDU  • ͭͷܥྻ •

    ࣌ؒ୯Ґͱ෼୯Ґͷه࿥ • 5SBJOWBMUFTUNPOUIT • &$-ਓͷిྗফඅͷσʔλ [Li, 2019] • ࣌ؒ୯Ґͷه࿥ • 5SBJOWBMUFTUNPOUIT • 8FBUIFSΞϝϦΧ߹ऺࠃͷ஍఺ͷؾީσʔλ • ࣌ؒ୯Ґͷه࿥ • 5SBJOWBMUFTUNPOUIT
  25. 43 Experiment: Baseline and Evaluation Metric • #BTFMJOF • ARIMA

    • Prophet [Taylor, 2018] • LSTMa [Bahdanau, 2014] • LSTnet [Lai, 2018] • DeepAR [Salinas, 2020] • LogTrans [Li, 2019] • Reformer [Kitaev, 2019] • &WBMVBUJPO.FUSJDT • Mean Squared Error (MSE): " < ∑%7" < 𝑦 − Q 𝑦 # • Mean Absolute Error (MAE): " < ∑%7" < 𝑦 − Q 𝑦
  26. 44 Result: Univariate Time-series Forecasting • Informerが多くのデータセット+ 予測先において⾼い精度を達成 (特に⻑期予測) •

    Query Sparsityを採⽤しなかったInformerと⽐べても,Informerが⾼い精度を達成 Attentionが着⽬する場所を制限することの効果を⽰す
  27. 52 [Wu, 2020] Wu, Neo, et al. "Deep transformer models

    for time series forecasting: The influenza prevalence case." arXiv preprint arXiv:2001.08317 (2020). [Yu, 2020] Yu, Cunjun, Xiao Ma, Jiawei Ren, Haiyu Zhao, and Shuai Yi. "Spatio-temporal graph transformer networks for pedestrian trajectory prediction." In European Conference on Computer Vision, pp. 507-523. Springer, Cham, 2020. [Vaswani, 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). [Yu, 2017] Yu, Fisher, Vladlen Koltun, and Thomas Funkhouser. "Dilated residual networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. [Chen, 2019] Chen, Nanxin, et al. "Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition." arXiv preprint arXiv. [Li, 2019] Li, Shiyang, et al. "Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting." Advances in Neural Information Processing Systems 32 (2019): 5243-5253. [Taylor, 2018] Taylor, Sean J., and Benjamin Letham. "Forecasting at scale." The American Statistician 72.1 (2018): 37-45. [Bahdanau, 2014] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014). [Lai, 2018] Lai, Guokun, et al. "Modeling long-and short-term temporal patterns with deep neural networks." The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 2018. [Salinas, 2020] Salinas, David, et al. "DeepAR: Probabilistic forecasting with autoregressive recurrent networks." International Journal of Forecasting 36.3 (2020): 1181-1191. [Li, 2019] Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.-X.; and Yan, X. 2019. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Fore- casting. arXiv:1907.00235 . [Kitaev, 2019] Kitaev, N.; Kaiser, L.; and Levskaya, A. 2019. Reformer: The Efficient Transformer. In ICLR. Reference