Abstract & Introduction
n Neural machine translation (NMT) has recently gained widespread attention because of its high
translation accuracy.
n It shows poor performance in the translation of long sentences, which is a major issue in low-
resource languages.
n It is assumed that this issue is caused by insufficient number of long sentences in the training data.
Ø We proposes a simple data augmentation method to handle long sentences.
Ø We combined this method with back-translation.
Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation
Seiichiro Kondo, Kengo Hotate, Tosho Hirasawa, Masahiro Kaneko, and Mamoru Komachi
Tokyo Metropolitan University
[email protected]
Method
Figure 1. Over view
new train data
…
He lived a hard life. I can't swim at all.
I can't swim at all. She is my friend.
𝑥!
𝑥"
𝑥!#"
𝑥!
彼は⾟い⼈⽣を送った。 私は全く泳げない。
私は全く泳げない。 彼⼥は私の親友だ。
𝑦!#"
𝑦!
𝑦!
𝑦"
𝑥": He lived a hard life.
𝑥$: I can't swim at all.
𝑥! 𝑦!
…
original data
𝑦$: 私は全く泳げない。
𝑥%: She is my friend. 𝑦%: 彼⼥は私の親友だ。
𝑦": 彼は⾟い⼈⽣を送った。
𝑥′"
𝑥′$
𝑥′" 𝑦"
𝑥′! 𝑦!
𝑥′% 𝑦%
𝑥′$ 𝑦$
…
𝑥′!#"
𝑥′! 𝑦!#"
𝑦!
𝑥′!
𝑥′" 𝑦!
𝑦"
…
𝑥′$
𝑥′% 𝑦$
𝑦%
𝑦"
𝑦$
concatenation
back translation
concatenation
Experiments Setting
n We used Transformer model (Vaswani et al., 2017).
n Dataset
*400,000 sentences were randomly extracted
from 2,000,000 sentences.
n We also conducted oversampling as baseline.
n A single sentence is fed as the input during the test process.
n We computed the average of the BLEU scores of three runs
with different seeds.
n Human evaluation was also conducted.
corpus train sent. dev sent. test sent.
ASPEC 400,000* 1,790 1,812
The number of samples
Sentence length in train data
vanilla: original data OS: over sampling
concat: concatenation (proposed method) BT: back translation
Results
n The BLEU score of “vanilla + concat” is more stable when
applied for translation with sentence lengths of longer than 51
words.
n In particular, the score of the sentence lengths of longer than 41
is significantly improved, which indicates that the proposed
method is more effective for long sentence translation.
n It is conceivable that back-translation and concatenation has
independent factors that improve the accuracy of the translation.
n We observed that the outputof the proposed method improved or
were compara-ble under almost all conditions.
n The translation quality degrades when there are more than
1,000,000 sentences in the training data.
length all 1〜10 11〜20 21〜30 31〜40 41〜50 51〜60 61〜70 71〜
sent. 1,812 73 529 600 341 164 74 18 13
vanilla
(400K)
26.5 22.9 23.0 26.2 27.1 29.6 28.5 28.8 23.6
+ OS
(+400K)
25.0 21.8 23.0 24.9 25.7 27.0 27.0 22.5 18.1
+ concat
(+400K)
26.6 21.4 23.3 25.7 27.5 28.7 29.5 28.2 29.0
+ BT
(+400K)
28.8 24.3 25.5 28.3 29.5 30.6 31.6 28.7 28.7
+BT
+concat
(+1.2M)
29.4 25.4 25.6 28.6 30.1 31.5 33.1 29.9 30.1
Figure 2. Distribution of each data set.
Example
src
Results of the analysis shows high accuracy properties, such as the reproducibility of relative standard deviation 0.3~0.9% varified by
repetitive analyses of ten times, the clibration curves with correlation coefficient of 1 verified by tests of standard materials in using six
kinds of acetonitrile dilute solutions, and the formaldehyde detection limit of 0.0018μg/mL.
tgt
結果は,相対標準偏差0.3〜0.9%の再現性(10回の繰返し分析),相関係数1の検量線(6種類のアセトニトリル希釈溶液による標
準資料の検定),0.0018μg/mLのホルムアルデヒド検出限界,など⾼い精度を得た。
vanilla
+ BT
6種のアセトニトリル希薄溶液を⽤いた標準物質の試験及びホルムアルデヒド検出限界は0.0018μg/mLであった。
vanilla
+ BT
+ concat
分析の結果は10回の繰り返し解析で相対標準偏差0.3〜0.9%の再現性,6種のアセトニトリル希薄溶液を⽤いた標準物質の試験によ
り検証された1の相関係数を持つクライテリア曲線,及び0.0018μg/mLのホルムアルデヒド検出限界など⾼い精度を⽰した。
adequacy fluency
length win tie lose win tie lose
1 ‒ 10 4 5 3 3 7 2
11 ‒ 20 20 39 29 21 47 20
21 ‒ 30 34 35 31 33 42 25
31 ‒ 40 23 21 13 17 24 16
41 ‒ 50 10 6 11 6 10 11
51 ‒ 6 5 6 7 6 4
overall 97 111 93 87 136 78
Table 1. BLEU score Table 2. Human evaluation
Figure 3. Effectiveness of the
proposed method for each data size.
all 51‒60 61‒70 71‒
Length
Difference of BLEU score
200K 400K 600K 800K 1M 2M (orig)