TaskͰ΄ͱΜͲͷνʔϜ͕࠾༻ March 20, 2020 RIKEN AIP / Tohoku University 3 5 4 + reduce batch size (4k ! 1k tokens) 12.40 ± 0.08 31.97 ± 0 6 5 + lexical model 13.03 ± 0.49 31.80 ± 0 7 5 + aggressive (word) dropout 15.87 ± 0.09 33.60 ± 0 8 7 + other hyperparameter tuning (learning rate, 16.57 ± 0.26 32.80 ± 0 model depth, label smoothing rate) 9 8 + lexical model 16.10 ± 0.29 33.30 ± 0 Table 2: German!English IWSLT results for training corpus size of 100k words and 3.2M words (full c Mean and standard deviation of three training runs reported. 105 106 0 10 20 30 32.8 30.8 28.7 24.4 20.6 16.6 26.6 24.9 23 20.5 18.3 16 25.7 18.5 11.6 1.8 1.3 0 corpus size (English words) BLEU neural MT optimized phrase-based SMT neural MT baseline Figure 2: German!English learning curve, showing BLEU as a function of the amount of parallel training data, for PBSMT and NMT. 4.3 NMT Systems We train neural systems with Nematus (Sennrich et al., 2017b). Our baseline mostly follows the size, model depth, regularization paramete learning rate. Detailed hyperparameters ported in Appendix A. 5 Results Table 2 shows the effect of adding different ods to the baseline NMT system, on the ult data condition (100k words of training dat the full IWSLT 14 training corpus (3.2M w Our ”mainstream improvements” add aroun BLEU in both data conditions. In the ultra-low data condition, reduci BPE vocabulary size is very effective BLEU). Reducing the batch size to 1000 to sults in a BLEU gain of 0.3, and the lexical yields an additional +0.6 BLEU. Howeve gressive (word) dropout6 (+3.4 BLEU) and other hyperparameters (+0.7 BLEU) has a st effect than the lexical model, and adding t Figure from [Sennrich and Zhang 2019]