Slide 3
Slide 3 text
GECͷɿσʔλ͕Γͳ͍
• ࠷نͷେ͖ͳίʔύε(Lang-8)Ͱ2Mจର
• σʔλͷྔΛ૿͢͜ͱॏཁ
• ۙɺGECʹ͓͚Δʮٙࣅσʔλੜʯ͕Μ
• BEA-2019 Shared TaskͰ΄ͱΜͲͷνʔϜ͕࠾༻
March 20, 2020 RIKEN AIP / Tohoku University 3
5 4 + reduce batch size (4k ! 1k tokens) 12.40 ± 0.08 31.97 ± 0
6 5 + lexical model 13.03 ± 0.49 31.80 ± 0
7 5 + aggressive (word) dropout 15.87 ± 0.09 33.60 ± 0
8 7 + other hyperparameter tuning (learning rate,
16.57 ± 0.26 32.80 ± 0
model depth, label smoothing rate)
9 8 + lexical model 16.10 ± 0.29 33.30 ± 0
Table 2: German!English IWSLT results for training corpus size of 100k words and 3.2M words (full c
Mean and standard deviation of three training runs reported.
105
106
0
10
20
30
32.8
30.8
28.7
24.4
20.6
16.6
26.6
24.9
23
20.5
18.3
16
25.7
18.5
11.6
1.8
1.3
0
corpus size (English words)
BLEU
neural MT optimized
phrase-based SMT
neural MT baseline
Figure 2: German!English learning curve, showing
BLEU as a function of the amount of parallel training
data, for PBSMT and NMT.
4.3 NMT Systems
We train neural systems with Nematus (Sennrich
et al., 2017b). Our baseline mostly follows the
size, model depth, regularization paramete
learning rate. Detailed hyperparameters
ported in Appendix A.
5 Results
Table 2 shows the effect of adding different
ods to the baseline NMT system, on the ult
data condition (100k words of training dat
the full IWSLT 14 training corpus (3.2M w
Our ”mainstream improvements” add aroun
BLEU in both data conditions.
In the ultra-low data condition, reduci
BPE vocabulary size is very effective
BLEU). Reducing the batch size to 1000 to
sults in a BLEU gain of 0.3, and the lexical
yields an additional +0.6 BLEU. Howeve
gressive (word) dropout6 (+3.4 BLEU) and
other hyperparameters (+0.7 BLEU) has a st
effect than the lexical model, and adding t
Figure from [Sennrich and Zhang 2019]