Upgrade to Pro — share decks privately, control downloads, hide ads and more …

機械翻訳コンペティション参加報告

Shun Kiyono
February 26, 2021

 機械翻訳コンペティション参加報告

第6回特許情報シンポジウムでの講演資料です

Shun Kiyono

February 26, 2021
Tweet

More Decks by Shun Kiyono

Other Decks in Research

Transcript

  1. ࣗݾ঺հ • ܦྺ • 2013 – 2019౦๺େֶʢֶ࢜  म࢜ʣ •

    2019 – ཧԽֶݚڀॴ ֵ৽஌ೳ౷߹ݚڀηϯλʔ • 2020 – : ౦๺େֶʢത࢜ޙظ՝ఔ ʣ • ͜Ε·Ͱͷݚڀ • ੜ੒ܕཁ໿ [BlackboxNLP 2018], [PACLIC 2018] • ߴ଎ & େن໛ͳ൒ڭࢣ͋Γֶश [AAAI 2019] • จ๏ޡΓగਖ਼ [EMNLP 2019], [TASLP 2020] • ػց຋༁ʁ 3 ػց຋༁ͷج൫ٕज़͸ λεΫΛ·͍ͨͰ࢖ΘΕ͍ͯΔ ⇛ଞλεΫͷܦݧ͕࢖͑Δ ͭ·Γɺ৘೤͑͋͞Ε͹େৎ෉ʂ
  2. ࠓճࢀՃͨ͠ίϯϖςΟγϣϯɿ8.5 • WMT: ΋ͱ΋ͱ͸ػց຋༁ͷϫʔΫγϣοϓ • ਺೥લʹࠃࡍձٞʹͳΓ·ͨ͠ • ͞·͟·ͳίϯϖςΟγϣϯ͕ซઃ • ৽ฉهࣄ຋༁

    • ະ஌ͷυϝΠϯͷจʹର͢Δ຋༁ • ڭࢣͳ͠ػց຋༁ • ੜ෺ֶɾҩྍܥจॻͷػց຋༁ • νϟοτจͷػց຋༁ 6 զʑ͕ࢀՃͨ͠λεΫ ࠷΋ྺ࢙͕ݹ͍ ͔ͭ ڝ૪ͷܹ͍͠λεΫ
  3. ίϯϖςΟγϣϯͷྲྀΕʢ೔ӳͷ৔߹ʣ 7 ᶃ γεςϜߏங ᶄ γεςϜධՁ ೔ӳͷର༁ίʔύε ୯ݴޠίʔύεʢ೔ʣ ୯ݴޠίʔύεʢӳʣ લ೥౓·Ͱͷςετσʔλ

    ௒େྔ(100ຕ~)ͷGPUͰ ࢼߦࡨޡ γεςϜ׬੒ σʔληοτͷ४උɾલॲཧ ೔ຊޠจ ຋༁จ ςετσʔλͷ຋༁ ࣗಈධՁ (BLEU)ɾਓखධՁ νʔϜϝϯόʔ
  4. • զʑɿ౦๺େ-ཧݚAIP-NTTνʔϜ • ͦͷଞʹ ژେɾNICTɾDeepMindɾFacebookɾ ΤδϯόϥେɾNAVERɾOPPOɾTencentɾ WeChatɾDiDi ͳͲ͕ࢀՃ ࢀՃνʔϜͷ঺հ 8

    ࠓ໺ ᰜਓ  ਗ਼໺ ॢ  ҏ౻ ୓ւ  ৿Լ ກ  ླ໦ ५  ౦๺େֶ ཧݚ"*1 /55ίϛϡχέʔγϣϯՊֶجૅݚڀॴ ίϯϖςΟγϣϯ্Ґͷৗ࿈ 8"5ɾ8.5ɾ 8"5ͰҐ
  5. ݁ՌɿࣗಈධՁई౓ʢ#-&6ʣͰ্Ґ ಠˠӳҐ Team BLEU Tohoku-AIP-NTT 43.8 Huoshan_Translate 43.5 OPPO 43.2

    UEDIN 42.3 Online-B 41.9 ӳˠಠҐ Team BLEU Tohoku-AIP-NTT 38.8 Tencent_Translation 38.6 OPPO 38.6 Huoshan_Translate 38.2 eTranslation 37.9 ೔ˠӳҐ Team BLEU NiuTrans 26.7 Tohoku-AIP-NTT 25.5 OPPO 24.8 NICT_Kyoto 22.8 eTranslation 22.2 ӳˠ೔Ґ Team BLEU NiuTrans 28.4 OPPO 27.3 ENMT 25.9 Tohoku-AIP-NTT 25.8 NICT_Kyoto 23.9 9
  6. ਓखධՁɿશݴޠରͰҐΛୡ੒ tion aseline sformer mer Inuktitut!English Ave. Ave. z System

    73.1 0.168 NiuTrans 72.9 0.167 Facebook-AI 71.2 0.100 CUNI-Transfer 70.7 0.096 Groningen 70.3 0.072 SRPOL 71.1 0.066 Helsinki 70.2 0.055 NRC 70.2 0.054 UEDIN 70.1 0.047 UQAM-TanLe 68.8 0.006 NICT-Kyoto 68.4 0.035 OPPO Japanese!English Ave. Ave. z System 75.1 0.184 Tohoku-AIP-NTT 76.4 0.147 NiuTrans 74.1 0.088 OPPO 75.2 0.084 NICT-Kyoto 73.3 0.068 Online-B 70.9 0.026 Online-A 71.1 0.019 eTranslation 64.1 0.208 zlabs-nlp 66.0 0.220 Online-G 61.7 0.240 Online-Z Polish!English Ave. Ave. z System 77.2 0.131 SRPOL 76.7 0.097 Online-G 77.7 0.096 NICT-Rui 77.9 0.094 Online-B 78.1 0.085 SJTU-NICT 76.6 0.083 Online-A 75.2 0.050 OPPO 77.3 0.006 Online-Z 78.1 0.003 CUNI-Transformer 76.1 0.038 NICT-Kyoto 73.3 0.041 VolcTrans 73.2 0.048 PROMT-NMT 74.3 0.072 Tilde 74.0 0.130 zlabs-nlp Russian!English Ave. Ave. z System 79.3 0.124 Online-G 80.9 0.114 Online-A 79.7 0.113 OPPO 80.6 0.104 eTranslation 79.5 0.096 PROMT-NMT 80.2 0.072 Online-B 79.9 0.062 HUMAN 77.7 0.042 ariel xv 79.2 0.026 AFRL 10 74.1 0.049 UEDIN-CUNI 74.1 0.065 CUNI-T2T-2018 72.5 0.069 Online-G 71.8 0.080 Online-Z 71.9 0.094 PROMT-NMT 72.0 0.141 zlabs-nlp German!English Ave. Ave. z System 82.6 0.228 VolcTrans 84.6 0.220 OPPO 82.2 0.186 HUMAN 81.5 0.179 Tohoku-AIP-NTT 81.3 0.179 Online-A 81.5 0.172 Online-G 79.8 0.171 PROMT-NMT 82.1 0.167 Online-B 78.5 0.131 UEDIN 78.8 0.085 Online-Z 74.2 0.079 WMTBiomedBaseline 71.1 0.106 zlabs-nlp 20.5 1.618 yolo Khmer!English Ave. Ave. z System 69.0 0.168 Online-B 69.4 0.146 GTCOM 68.5 0.136 Huawei-TSC 62.6 0.047 VolcTrans 58.1 0.210 OPPO 56.9 0.222 Online-Z 55.5 0.282 Online-G Pashto!English Ave. Ave. z System 67.3 0.032 Online-B 66.7 0.024 GTCOM 65.5 0.016 Huawei-TSC 62.7 0.106 VolcTrans 62.1 0.164 OPPO 61.0 0.195 Online-Z 76.0 0.016 DiD 75.2 0.022 On 71.7 0.153 zla Tamil!Eng Ave. Ave. z System 68.7 0.203 GTCO 70.3 0.202 OPPO 68.9 0.176 Online 73.9 0.173 Faceb 70.9 0.150 NiuTr 71.9 0.116 VolcT 64.5 0.007 Online 66.4 0.001 zlabs- 67.5 0.016 Micro 60.8 0.020 UEDI 64.5 0.068 Online 63.4 0.078 DCU 53.7 0.398 Online 53.9 0.451 TALP Table 12: Official results of WMT20 News Translation Task for translation into-English. Systems ordered b z-score; systems within a cluster are considered tied; lines indicate clusters according to Wilcoxon rank-sum tes grayed entry indicates resources that fall outside the constraints provided. 77.1 0.322 UEDIN-CUNI 70.5 0.048 Online-B 69.1 0.017 Online-Z 68.7 0.008 Online-A 62.7 0.216 Online-G 48.1 0.760 zlabs-nlp English!German Ave. Ave. z System 90.5 0.569 HUMAN-B 87.4 0.495 OPPO 88.6 0.468 Tohoku-AIP-NTT 85.7 0.446 HUMAN-A 84.5 0.416 Online-B 84.3 0.385 Tencent-Translation 84.6 0.326 VolcTrans 85.3 0.322 Online-A 82.5 0.312 eTranslation 84.2 0.299 HUMAN-paraphrase 82.2 0.260 AFRL 81.0 0.251 UEDIN 79.3 0.247 PROMT-NMT 77.7 0.126 Online-Z 73.9 0.120 Online-G 68.1 0.278 zlabs-nlp 65.5 0.338 WMTBiomedBaseline 59.8 53.9 52.8 Ave. 88.6 76.4 75.6 76.3 74.0 70.6 72.0 72.4 69.7 71.8 70.1 69.0 64.5 63.9 47.7 Table 13: Official results of WMT20 News Translatio z-score; systems within a cluster are considered tied; lin grayed entry indicates resources that fall outside the cons English!Chinese Ave. Ave. z System 80.6 0.568 HUMAN-B 82.5 0.529 HUMAN-A 80.0 0.447 OPPO 79.0 0.420 Tencent-Translation 77.3 0.415 Huawei-TSC 77.4 0.404 NiuTrans 77.7 0.387 SJTU-NICT 76.6 0.373 VolcTrans 73.7 0.282 Online-B 73.0 0.241 Online-A 69.5 0.136 dong-nmt 68.5 0.135 Online-Z 70.1 0.122 Online-G 68.7 0.082 zlabs-nlp English!Czech Ave. Ave. z System 85.6 0.654 HUMAN 82.2 0.546 CUNI-DocTransformer 81.8 0.538 OPPO 80.8 0.505 SRPOL 80.5 0.458 CUNI-T2T-2018 80.4 0.441 eTranslation 79.3 0.434 CUNI-Transformer 77.1 0.322 UEDIN-CUNI English!Inuktitut (News only) Ave. Ave. z System 90.5 0.574 HUMAN 75.3 0.425 MultiLingual-Ubiqus 77.4 0.409 CUNI-Transfer 71.9 0.369 NRC 74.6 0.368 Facebook-AI 79.2 0.364 NICT-Kyoto 71.6 0.339 Groningen 75.2 0.296 Helsinki 72.8 0.282 SRPOL 68.9 0.084 UQAM-TanLe 66.4 0.081 UEDIN 48.2 0.384 OPPO English!Japanese Ave. Ave. z System 79.7 0.576 HUMAN 77.7 0.502 NiuTrans 76.1 0.496 Tohoku-AIP-NTT 75.8 0.496 OPPO 75.9 0.492 ENMT 71.8 0.375 NICT-Kyoto 71.3 0.349 Online-A 70.2 0.335 Online-B 63.9 0.159 zlabs-nlp 59.8 0.032 Online-Z Ave 83. 79. 75. 77. 77. 78. 76. 72. 72. 72. 74. 71. 68. ʢਓखධՁʹ౷ܭత༗ҙ͕ࠩͳ͍৔߹ɺಉ཰ҰҐͱ͍͏ѻ͍ʣ
  7. ͦͷଞɿݴޠݱ৅ςετͰ΋޷੒੷ 11 ෳ୯ޠදݱɾݻ༗໊ࢺ ػೳޠɾಈࢺͷ੍࣌ͱ͍ͬͨ ݴޠݱ৅ͷऔΓѻ͍ʹؔ͢Δ ςετ category items Tohoku Huoshan

    UEdin Onl-B Onl-G Onl-A PROMT Ambiguity 81 82.7 77.8 72.8 79.0 84.0 76.5 64.2 Composition 49 98.0 98.0 93.9 93.9 95.9 93.9 89.8 Coordination & ellipsis 78 89.7 91.0 89.7 91.0 85.9 87.2 87.2 False friends 36 72.2 80.6 72.2 80.6 77.8 69.4 72.2 Function word 72 86.1 80.6 86.1 90.3 90.3 83.3 88.9 LDD & interrogatives 174 89.1 86.2 85.1 83.3 86.8 77.6 81.0 MWE 80 80.0 75.0 71.3 77.5 77.5 71.3 70.0 Named entitiy & terminology 89 92.1 84.3 87.6 82.0 82.0 88.8 87.6 Negation 20 100.0 100.0 100.0 100.0 100.0 95.0 100.0 Non-verbal agreement 61 91.8 88.5 88.5 86.9 90.2 83.6 82.0 Punctuation 60 96.7 98.3 98.3 71.7 61.7 100.0 98.3 Subordination 180 90.6 88.3 91.1 91.1 92.2 88.9 90.0 Verb tense/aspect/mood 4447 84.6 85.3 80.3 75.9 79.6 77.5 75.1 Verb valency 87 79.3 81.6 77.0 81.6 77.0 77.0 71.3 micro-average 5514 85.3 85.4 81.2 77.7 80.6 78.7 76.5 macro-average 5514 88.1 86.8 85.3 84.6 84.3 83.6 82.7 BLEU 43.8 43.5 42.3 41.9 41.4 40.4 39.6 Table 5: Accuracies (%) of successful translations for 11 systems and 14 categories. Boldface indicates the si Onl-A Onl-B Onl-G PROMT category 2019 2020 2019 2020 2019 2020 2020 Ambiguity +2.6 +7.7 +1.3 +2.6 +2.6 +11.5 +16.7 Composition +10.4 +2.1 -4.1 +12.5 +12.5 +10.4 [Avramidis+2020] զʑͷγεςϜ͕ ϚΫϩฏۉ஋Ͱ࠷΋ྑ͍੒੷
  8. Α͋͘Δ 4.5WT/.5 ؍ 13 4.5ʢ౷ܭతػց຋༁ʣ Ȭdz̺̀ͦ ɵƹ̺̀ͦ ȴʗ̺̀ͦ ǺȰǂȿ /

    ͦ͠͠ ̶͍͌ͦ̿ͦ͡͠ ǨĂ̆ɭ̑ ȩȷ ̀͘͠ ŇǪ ǺȰǭŖ ̼̬͛ͦ̓ͤ ̪̬ͤͤ͟͞/ ̢̡̬̿ͤ͟͞ ̯̻̀ͦͦ / ǺȰ̴̥ͤͤ Ȏƽ̄ǮȪǺȰ(SMT)̆Řǫ̑ sȷ/€ ̢̠͗ͤ́͞ N-best ǺȰǭŖ N-best ǺȰǭŖ ŮȨǺȰ GIZA++ MGIZA FastAlign Nile SRILM KenLM RNNLM Moses, Joshua Travatar, KyotoEBMT MERT MIRA PRO 13 😩 ෳ਺ϞδϡʔϧʹΑΔ൥ࡶͳγεςϜ 😩 Τϥʔ఻ൖ͕຋༁ਫ਼౓ʹѱӨڹ /.5ʢχϡʔϥϧػց຋༁ʣ 😀 ୯ҰͷϞσϧͰҰؾ௨؏ֶश͕Մೳ 😀 Τϥʔ఻ൖΛղফ→ߴ͍຋༁ਫ਼౓ ຋༁Ϟσϧ ର༁ίʔύε ܇࿅ ݪݴޠจ ର৅ݴޠจ ೖྗ σίʔυ ※ ※θϩ͔Β࢝ΊΔχϡʔϥϧωοτϫʔΫػց຋༁ https://www.slideshare.net/ToshiakiNakazawa/nlp2017-nmt-tutorial ΑΓҾ༻
  9. Α͋͘Δ 4.5WT/.5 ؍ 14 4.5ʢ౷ܭతػց຋༁ʣ Ȭdz̺̀ͦ ɵƹ̺̀ͦ ȴʗ̺̀ͦ ǺȰǂȿ /

    ͦ͠͠ ̶͍͌ͦ̿ͦ͡͠ ǨĂ̆ɭ̑ ȩȷ ̀͘͠ ŇǪ ǺȰǭŖ ̼̬͛ͦ̓ͤ ̪̬ͤͤ͟͞/ ̢̡̬̿ͤ͟͞ ̯̻̀ͦͦ / ǺȰ̴̥ͤͤ Ȏƽ̄ǮȪǺȰ(SMT)̆Řǫ̑ sȷ/€ ̢̠͗ͤ́͞ N-best ǺȰǭŖ N-best ǺȰǭŖ ŮȨǺȰ GIZA++ MGIZA FastAlign Nile SRILM KenLM RNNLM Moses, Joshua Travatar, KyotoEBMT MERT MIRA PRO 13 😩 ෳ਺ϞδϡʔϧʹΑΔ൥ࡶͳγεςϜ 😩 Τϥʔ఻ൖ͕຋༁ਫ਼౓ʹѱӨڹ /.5ʢχϡʔϥϧػց຋༁ʣ 😀 ୯ҰͷϞσϧͰҰؾ௨؏ֶश͕Մೳ 😀 Τϥʔ఻ൖΛղফ→ߴ͍຋༁ਫ਼౓ ຋༁Ϟσϧ ର༁ίʔύε ܇࿅ ݪݴޠจ ର৅ݴޠจ ೖྗ σίʔυ ※ ※θϩ͔Β࢝ΊΔχϡʔϥϧωοτϫʔΫػց຋༁ https://www.slideshare.net/ToshiakiNakazawa/nlp2017-nmt-tutorial ΑΓҾ༻ ଟ͘ͷ৔໘Ͱ͜Ε͸ਖ਼͍͕͠ʜ ʮ࠷ઌ୺ͷ/.5ʯͰ͸ঢ়گ͕ҧ͏
  10. ͍ͭ΋ͷ/.5͔Β࠷ઌ୺/.5΁ 17 ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ

    ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  11. ςετσʔλ͸৽ฉهࣄ υϝΠϯͳͷͰɺ ৽ฉهࣄͷσʔλʹదԠ ͍ͤͨ͞ ⇛ϑΝΠϯνϡʔχϯά ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 18 ຋༁ϞσϧʢTransformerʣ ͸ϋΠύʔύϥϝʔλʹ ରͯ͠ඇৗʹηϯγςΟϒ

    ϋΠύϥͷ஌ݟ΋೔ਐ݄า ⇛ϋΠύʔύϥϝʔλͷௐ੔ ର༁ίʔύε͚ͩͰ͸ σʔλ͕଍Γͳ͍ ୯ҰݴޠίʔύεΛ࢖ͬͯ σʔλΛ૿΍͍ͨ͠ ⇛ٯ຋༁ʹΑΔσʔλ֦ு ࡾਓدΕ͹จघͷ஌ܙʂ ଞͷϞσϧͷҙݟ΋औΓ ೖΕͯग़ྗΛܾΊ͍ͨ ⇛ϦϥϯΩϯά Ϟσϧֶश݁Ռʹ͸Ϝϥ ͕͋ΔͷͰ ෳ਺ͷϞσϧΛಠཱʹ܇ ࿅ͯ͠ଟ਺ܾ͠Α͏ ⇛Ξϯαϯϒϧ ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  12. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 19 ຋༁ϞσϧʢTransformerʣ ͸ϋΠύʔύϥϝʔλʹ ରͯ͠ඇৗʹηϯγςΟϒ ϋΠύϥͷ஌ݟ΋೔ਐ݄า ⇛ϋΠύʔύϥϝʔλͷௐ੔ ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ

    ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  13. ϋΠύʔύϥϝʔλͷௐ੔ • ϞσϧɿTransformer [Vaswani+2017] • ۙ೥ͷσϑΝΫτతͳϞσϧͰ͋Γɺ࢖Θͳ͍ͱ͍͏બ୒ࢶ͸ͳ͍ • ϑΟʔυϑΥϫʔυ૚ͷ࣍ݩ਺Λഒɾ૚ͷ਺Λ 6 à

    9ʹઃఆ͠ɺ ΑΓଟ͘ͷσʔλΛֶश͢Δ͜ͱΛͶΒ͏ • ௒ڊେόοναΠζ [Ott+2018] • ௨ৗ 4,000 τʔΫϯ à 512,000 τʔΫϯ΁ • ऩଋ଎౓UPɾ൚Խੑೳ޲্ • ܦݧతʹֶश΋҆ఆ͢Δ • Update delay ʢผ໊ ΰʔετόονʣΛ׆༻͢Δ͜ͱͰ࣮ݱ • ڊେֶश཰ [Ott+2018] • AdamͷεςοϓαΠζ 0.0005 à0.001 • ऩଋ଎౓UP • ڊେόοναΠζͱͷ૊Έ߹Θ͕ͤඇৗʹॏཁ • νΣοΫϙΠϯτฏۉ๏ • ద౰ͳ୯Ґʢྫ: ຖEpoch, 2k UpdatesʣͰϞσϧΛอଘ͓ͯ͘͠ • ֶशޙɺอଘͨ͠Ϟσϧͷฏۉ஋Λܭࢉ͠ɺਪ࿦ʹ༻͍Δ • BLEUείΞ͕0.1~0.2΄Ͳվળ͢Δ [Popel+2018] • Pre-layer-normalization • ϑΟʔυϑΥϫʔυ૚ͱΞςϯγϣϯ૚ͷલͰLayerNormΛܭࢉ • ଟ૚Transformerͷֶश͕҆ఆ͢Δͱͷใࠂ [Xiong+2020] 20 Under review as a conference paper at ICLR 2020 the warm-up stage happens in the first several iterations, we investigate the optimization behavior at initialization of the Post-LN Transformer. According to our theoretical analysis, when putting the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, without the warm-up stage, directly using a large learning rate to those parameters may not lead to an improved model and can even make the optimization process unstable. Using a warm-up stage and training the model from small learning rates practically avoid this problem. Figure 1: (a) Post-LN Transformer layer; (b) Pre- LN Transformer layer. As the location of the layer normalization plays a crucial role in controlling the gradient scales, we investigate whether there are some other ways of positioning the layer normalization that lead to better-normalized gradients. In par- ticular, we study another variant, the Trans- former with Pre-Layer Normalization (Pre-LN) (Klein et al., 2018). The Pre-LN Transformer puts the layer normalization inside the residual connection and equips with an additional final- layer normalization before prediction (Please see Figure 1 for the differences between the two variants of the Transformer architectures). In this paper, we show that the gradients are well- behaved without any exploding or vanishing at initialization for the Pre-LN Transformer both theoretically and empirically. Given the gradients are well-behaved in the Pre- LN Transformer, it is natural to consider re- moving the learning rate warm-up stage during training. We conduct extensive experiments, including IWSLT14 German-English transla- tion, WMT14 English-German translation, and BERT pre-training tasks. We show that, in all Figure 1: The Transformer - model architecture. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, posit wise fully connected feed-forward network. We employ a residual connection [11] around each the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-laye LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-la itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedd layers, produce outputs of dimension dmodel = 512. Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-h attention over the output of the encoder stack. Similar to the encoder, we employ residual connecti around each of the sub-layers, followed by layer normalization. We also modify the self-attent sub-layer in the decoder stack to prevent positions from attending to subsequent positions. T masking, combined with fact that the output embeddings are offset by one position, ensures that predictions for position i can depend only on the known outputs at positions less than i. 3.2 Attention An attention function can be described as mapping a query and a set of key-value pairs to an out where the query, keys, values, and output are all vectors. The output is computed as a weighted s of the values, where the weight assigned to each value is computed by a compatibility function of query with the corresponding key. [Vaswani+2017]ΑΓҾ༻ [Xiong+2020]ΑΓҾ༻
  14. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 21 ର༁ίʔύε͚ͩͰ͸ σʔλ͕଍Γͳ͍ ୯ҰݴޠίʔύεΛ࢖ͬͯ σʔλΛ૿΍͍ͨ͠ ⇛ٯ຋༁ʹΑΔσʔλ֦ு ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ

    ର༁ίʔύε ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  15. ٯ຋༁ͱ͸Կ͔ʁ • ٯ຋༁ #BDLUSBOTMBUJPO#5 <4FOOSJDI > • ୯Ұݴޠίʔύε͔Βٙࣅର༁ίʔύεΛ ੜ੒͢ΔͨΊͷํ๏࿦ •

    /.5༻σʔλ֦ுͷσϑΝΫτతͳଘࡏ • ٯ຋༁ϞσϧΛ༻͍ͯ໨తݴޠͷจΛݪݴޠʹ ʮٯʯ຋༁ 23 ೔→ӳ ຋༁Ϟσϧ ୯ݴޠίʔύεʢ೔ʣ ຋༁ࡁΈίʔύεʢӳʣ ӳ೔ͷٙࣅର༁ίʔύε ӳ೔ϞσϧΛ࡞Δ৔߹ʜ
  16. ٯ຋༁ͷϓϩηε 24 ೔ӳͷର༁ίʔύε ೔→ӳ ຋༁Ϟσϧ ܇࿅ ᶃ ٯ຋༁Ϟσϧͷ܇࿅ ᶄ ೔ຊޠ୯ݴޠίʔύεΛ຋༁͠ɺٙࣅσʔλΛੜ੒

    ೔→ӳ ຋༁Ϟσϧ ୯ݴޠίʔύεʢ೔ʣ ຋༁ࡁΈίʔύεʢӳʣ ӳ೔ͷٙࣅର༁ίʔύε ᶅ ٙࣅσʔλΛ༻ֶ͍ͯश ӳ೔ͷର༁ίʔύε ӳ→೔ ຋༁Ϟσϧ ܇࿅ ӳ೔ͷٙࣅର༁ίʔύε
  17. ࠷ઌ୺/.5͸lటष͍zٕज़ͷू߹ମ 26 ࡾਓدΕ͹จघͷ஌ܙʂ ଞͷϞσϧͷҙݟ΋औΓ ೖΕͯग़ྗΛܾΊ͍ͨ ⇛ϦϥϯΩϯά ୯Ұݴޠ ίʔύε ٯ຋༁Ϟσϧ ର༁ίʔύε

    ٙࣅ ର༁ίʔύε ର༁ίʔύε ର৅υϝΠϯ ίʔύε ςετσʔλ ຋༁Ϟσϧ ग़ྗީิ/จ ຋༁จ -3 ॱํ޲ ຋༁Ϟσϧ 3- ॱํ޲ ຋༁Ϟσϧ -3 ٯํ޲ ຋༁Ϟσϧ 3- ٯํ޲ ຋༁Ϟσϧ ϚεΫ ݴޠϞσϧ ୯ํ޲ ݴޠϞσϧ ࠷ऴ຋༁݁Ռ
  18. ϦϥϯΩϯάͰީิจ͔Βྑ͍຋༁ΛબͿ • ϦϥϯΩϯάແ͠ͷ৔߹ 1. ϏʔϜαʔνʹΑͬͯީิจΛ/จੜ੒ 2. είΞͷ࠷΋ߴ͍จΛग़ྗ • ߴείΞ =

    ࠷΋ྑ͍຋༁ Ͱ͸ͳ͍ • ଞͷީิจͷ΄͏͕ྑ͍຋༁ʹͳ͍ͬͯΔՄೳੑ • ϦϥϯΩϯάɿྑ͍຋༁Λݟ͚ͭग़ͨ͢Ίͷޙॲཧ 27 I have been extremely lucky. 9.5 1.1 ީิNจ είΞ ຋༁Ϟσϧ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ 8.2 4.2 2.9 ϦϥϯΩϯάͳ͠ͷ৔߹ ͜ͷจΛग़ྗ ຊ౰͸͜ͷจΛग़ྗ͍ͨ͠
  19. Ϟσϧͷू߹஌Ͱྑ͍຋༁Λ໨ࢦ͢ 28 ᶃ ީิ/จͷੜ੒ XϏʔϜαʔν ᶄ /จΛ֤ϞδϡʔϧͰείΞ෇͚ είΞͷ߹ܭͰιʔτ I have

    been extremely lucky. ຋༁Ϟσϧ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ ͱͯ΋޾ӡͰͨ͠ ඇৗʹӡ͕ྑ͔ͬͨɻ ۃΊͯ޾ӡͰ͋ͬͨ ࢲ͸ຊ౰ʹ޾ӡͰͨ͠ ࢲ͸ɺඇৗʹ޾ӡͩͬͨ είΞ Ϟδϡʔϧ ୯ํ޲ݴޠϞσϧ ૒ํ޲ݴޠϞσϧ ٯ຋༁Ϟσϧ ٯํ޲຋༁Ϟσϧ ͳͲͳͲ…
  20. ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0

    19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - - ࣮ݧ݁Ռ 29 • ֤ٕज़ͷ૊Έ߹ΘͤͰੑೳ޲্Λୡ੒ • ࠷ઌ୺ͷੑೳΛग़ͨ͢Ίʹ͸ෳࡶͳγεςϜ͕ඞཁ
  21. ߏஙͨ͠γεςϜɿ৭ʑͳ෺͕૿͍͑ͯΔ • ܇࿅σʔλͷྔ͕૿͍͑ͯΔ • ௨ৗɿ࠷େͰ . ఔ౓ • ࠓճɿӳಠͰ͸ .

    • Ϟσϧͷύϥϝʔλ͕૿͍͑ͯΔ • ௨ৗɿΤϯίʔμͱσίʔμͰͦΕͧΕ૚ͣͭ • ࠓճɿͦΕͧΕ૚ • Ϟσϧͷ਺͕૿͍͑ͯΔ • ΞϯαϯϒϧɾϦϥϯΩϯά༻ʹෳ਺ͷϞσϧ͕ඞཁ • ֤ݴޠͰ8Ϟσϧඞཁ → ߹ܭϞσϧ 31 γεςϜߏஙʹඞཁͳϦιʔεͷ૿Ճ
  22. ඞཁͳ΋ͷ͸ɺ݁ہ͓ۚ • DGX-2 ૬౰ͷϚγϯ͸ AWS Ͱ 60υϧ/hour • ͭ·Γɺ1Ϟσϧ࡞Δͷʹ1440υϧඞཁ •

    ͭ·Γɺ32Ϟσϧ࡞Δͷʹ46080υϧඞཁ • ͜Ε͸೔ຊԁʹͯ͠500ສԁऑ • ్தͷࢼߦࡨޡʹֹ͔͔ͬͨۚΛ߹ΘͤΔͱ ?ઍສԁ΄Ͳ͔͔ͬͨܭࢉ • ͋͘·Ͱ΋AWSͰGPUΛआΓͨ৔߹ͷ֓ࢉ • ॴଐ૊৫ͷϚγϯΛ࢖ͬͨͨΊɺ࣮ࡍͷֹۚͱ͸ ҟͳΓ·͢ 35 ݁ہ͍͘Β࢖ͬͨͷʁ ൿಗࣄ߲Ͱ͢
  23. ʢ༨ஊʣ্͍ͭͩͬͯʹ͸্͕͍Δ • GPT-3Λ࡞ΔͨΊʹඞཁͳ΋ͷ • σʔλ͸ https://lambdalabs.com/blog/demystifying-gpt-3/ ΑΓҾ༻ 36 V100 32GB

    (Ұຕ100ສԁ) GPU ࣌ؒ 355೥ ͝ઌ૆༷ͷ࡞Γ࢝Ίͨ GPT-3͕དྷि׬੒Ͱ͢ ͜ͷલग़ͨGPT-371ͷ ؒҧ͍ͱ͔Ͱ͸ͳ͘ʁ
  24. ;ͨͨͼɺ࣮ݧ݁Ռ 38 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ

    42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  25. ؍࡯ɿٯ຋༁͸ޮՌͳ͠ʁ • ٯ຋༁Ͱ܇࿅σʔλ͸໿10ഒʹ • ӳಠͰ͸BLEUείΞ͸΄ͱΜͲ޲্ͤͣ • ྫ͑͹ 42.4 à 42.7

    • ੑೳήΠϯ͕࿑ྗʹݟ߹͍ͬͯͳ͍… • More data, better same performance ͳͷ͔ʁ 39 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  26. ٯ຋༁ͷޮՌ͸#-&6ͰଌΕͳ͍ • ࠷ઌ୺NMTʹ͓͚Δٯ຋༁ͷޮՌʹ͍ͭͯ • [Edunov+2020] [Bogoychev+2019] 40 😀 ٯ຋༁͸ແବͰ͸ͳ͔ͬͨ 😩

    ਓखධՁͱ#-&6͕૬ؔ͠ͳ͍ੈքʹͳ͍ͬͯΔ… BLEU ٯ຋༁͋Γ ٯ຋༁ͳ͠ ≒ ٯ຋༁͋Γ ٯ຋༁ͳ͠ > ٙࣅର༁ίʔύεͰͷֶशʹΑΓ ग़ྗ͕ྲྀெʹͳ͍ͬͯΔ [Edunov+2020] ਓखධՁ
  27. ؍࡯ɿϦϥϯΩϯάͷޮՌ͕ബ͍ • ϦϥϯΩϯάʹ͸ɺഒҎ্ͷϞσϧ͕ඞཁ • ͭ·ΓɺഒҎ্ͷ͓͕͔͔͍ۚͬͯΔ • ͔͠͠ɺBLEUείΞ͕ࢥͬͨΑ͏ʹ޲্͠ͳ͍ • ྫ͑͹ 45.5

    à 45.7 • ·ͨͯ͠΋ੑೳήΠϯ͕࿑ྗʹݟ߹͍ͬͯͳ͍… 42 ID Setting EnàDe DeàEn EnàJa JaàEn (a) ϕʔεϥΠϯ 42.4 42.0 19.7 21.6 (b) ϕʔεϥΠϯ+ٯ຋༁ 42.7 42.5 22.0 23.9 (c) (b)+ϑΝΠϯνϡʔχϯά 44.9 42.3 23.1 24.4 (d) (c) x 4 (Ξϯαϯϒϧ) 45.5 42.8 23.9 25.4 (e) (d)+ϦϥϯΩϯά 45.7 43.8 24.9 26.2 - લ೥౓ͷ༏উγεςϜ 44.9 42.8 - -
  28. ղ͚ͳ͍໰୊Λղ͜͏ͱ͍ͯ͠Δʁ • ݪจͱީิจ͔Β͚ͩͰ͸ྑ͍຋༁จΛ൑அͰ͖ ͳ͍ͷͰ͸ʁ • ਓؒʹ΋ʮྑ͍຋༁จʯ͸Θ͔Βͳ͍ • ൑அ͢Δ͚ͩͷ৘ใ͕଍Γ͍ͯͳ͍ • ͲΜͳ৘ใ͕͋Ε͹ྑ͍͔

    ⇛ จ຺ʁ 43 ࿨ฏϓϩηεʹӨڹΛٴ΅ͨ͘͠͸ͳ͍ ࿨ฏϓϩηεʹӨڹΛ༩͑ͨ͋͘Γ·ͤΜɻ ࿨ฏϓϩηεʹӨڹΛٴ΅ͯ͠ཉ͘͠ͳ͍ ࿨ฏϓϩηεʹӨڹΛ༩͑ͨ͘ͳ͍ͷͰ͢ɻ ࿨ฏϓϩηεʹӨڹ͕ग़ͳ͍Α͏ʹ͍ͨ͠ɻ ࿨ฏϓϩηεʹӨڹΛٴ΅ͨ͋͘͠Γ·ͤΜ ࿨ฏϓϩηεʹӨڹΛ༩͑Δ͜ͱ͸๬·ͳ͍ɻ ࿨ฏϓϩηεʹӨڹΛ༩͑Δ͜ͱΛ๬·ͳ͍ɻ ຋༁γεςϜ
  29. ·ͱΊ • ࠷ઌ୺ͷNMTγεςϜߏங͔ΒΘ͔ͬͨ͜ͱ3ͭΛ ঺հ • ᶃ ࠷ઌ୺ͷNMTγεςϜ͸టष͍ • ᶄ ๲େͳϦιʔε͕ඞཁ

    • ᶅ lటष͍zNMTͷઌʹ৽͍͠ੈք͕ݟ͖͍͑ͯͯΔ • ೔ຊͷ૊৫Ͱ΋ɺϦιʔε͕͋Ε͹ੈք͸औΕΔʂ 45
  30. ࢀߟจݙ • [Avramidis+2020]: Avramidis, E., Macketanz, V., Lommel, A., &

    Uszkoreit, H. (2018). Fine-grained evaluation of Quality Estimation for Machine translation based on a linguistically motivated Test Suite. In Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing (pp. 243–248). Association for Machine Translation in the Americas. • [Vaswani+2017]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, ., & Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems 31 (NIPS 2017) (pp. 5998– 6008). • [Ott+2018]: Ott, M., Edunov, S., Grangier, D., & Auli, M. (2018). Scaling Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers (pp. 1–9). Association for Computational Linguistics. • [Xiong+2020]: Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, & Tie-Yan Liu (2020). On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (pp. 10524– 10533). PMLR. • [Sennrich+2016]: Sennrich, R., Haddow, B., & Birch, A. (2016). Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 86–96). Association for Computational Linguistics. • [Edunov+2020]: Edunov, S., Ott, M., Ranzato, M., & Auli, M. (2020). On The Evaluation of Machine Translation Systems Trained With Back-Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 2836–2846). Association for Computational Linguistics. • [Bogoychev+2019]: Nikolay Bogoychev, & Rico Sennrich (2019). Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation CoRR, abs/1911.03362. • [Freitag+2020]: Freitag, M., Grangier, D., & Caswell, I. (2020). BLEU might be Guilty but References are not Innocent. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 61–71). Association for Computational Linguistics. 46