Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. ɾओʹςΩετͷΈʹ༻͞Ε͍ͯΔ͕ɺը૾ɺԻɺಈըͳͲͷେنͳೖग़ྗΛޮ తʹॲཧ͢ΔҝͷվྑΛߦ͏ ɾੜͷংྻੑΛͳ͘͢ ɾWMT 2014ͷӳಠ༁λεΫʹΑΓ࣮ݧ ɾ͜Ε·Ͱͷ࠷ྑϞσϧʢΞϯαϯϒϧΛؚΉʣΛ2.0BLEUҎ্্ճΓɺ28.4ͱ͍͏৽ ͍͠࠷ઌͷBLEUείΞΛཱ֬ ɾֶशίετڝ߹ϞσϧͷԿͷҰ͔Ͱ࣮ݱ ɾTransformerͰɺΤϯίʔμͱσίʔμͷ྆ํͰɺstaked self-attentionͱpoint- wise, શ݁߹Λ༻ͯ͠ɺશମతͳΞʔΩςΫνϟΛߏங͍ͯ͠Δ ɾThis makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2. ɾTransformerɺ࠶ؼΈࠐΈΛҰΘͳ͍AttentionϝΧχζϜͷΈʹجͮ ͍ͨ৽͍͠γϯϓϧͳΞʔΫςΫνϟϞσϧ ɾฒྻԽ͕ՄೳͰ͋Γɺֶशʹඞཁͳ͕࣌ؒେ෯ʹॖ͞Εͨ Attention is all you need https://arxiv.org/pdf/1706.03762.pdf ʢ2017ʣVaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, 2021/03/25