WAT2021_Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

Machine Translation with Pre-speciﬁed Target-side Words Using a Semi-autoregressive Model
Seiichiro Kondo, Aomi Koyama, Tomoshige Kiyuna, Tosho Hirasawa, Mamoru Komachi Tokyo Metropolitan University WAT2021, Restricted Translation task 1

Restricted translation task • This task requires the output sentence
to contain all the pre-speciﬁed restricted target vocabularies (RTVs). • We are given a source sentence and a set of RTVs, and we are supposed to generate an output sentence that contains all the RTVs in the set. 2

Our approach to this task • Corpus reﬁnement • RecoverSAT
(Ran et al., 2020) • Sorting RTVs Using Source Alignment 3

Corpus Reﬁnement • The ASPEC (Nakazawa et al., 2016) training
sentences are ordered by sentence alignment scores. → The sentences with lower scores are considered relatively noisy data. • We used forward-translation to reﬁne the latter half of the ASPEC training data following (Morishita et al., 2019). x 1 y 1 x n/2 y n/2 x n/2+1 y n/2+1 x n y n x 1 x n/2 x n/2+1 x n x 1 x n/2 y’ n/2+ 1 y’ Ja-En model Ja-En model train x 1 y 1 x n/2 y n/2 x n/2+1 y’ n/2+1 x n y’ n 4

RecoverSAT Ja sentence BOS I will BOS look up I
will EOS look up a a EOS up up a DEL BOS word EOS a BOS word Encoder Decoder t = 0 t = 0 t = 0 t = 0 t = 1 t = 2 t = 1 t = 2 t = 3 t = 2 t = 1 t = 1 5

Forced translation • We manually get RecoverSAT to generate the
RTV at the beginning of an arbitrary segment. • Once the RTV has been generated, the model predicts the remainder of the segment in a semi-autoregressive manner. 6

Where did we input RTVs? • We place the i-th
RTV at the P i -th segment as follows: N S : The number of segments , N V : The number of RTVs. 7

Sorting RTVs Using Source Alignment • RecoverSAT outputs RTVs in
the order where they are inserted. → The order of inserting RTVs is important for accurate translation. • We used GIZA++ to align each RTV with a word in the input sentence and sorted the RTVs in the order of their corresponding input words. 8 A, B, C RTVs list ~b~a~c~. src sentence B, A, C sorted RTVs list ~b~a~c~. src sentence A, B, C ~b~a~c~. get alignment using GIZA++ RecoverSAT

Experimental Setup • We used SentencePiece to tokenize the training
data, where the vocabulary size was set to 4,000. • When determining the insertion order of RTVs using GIZA++, we used MeCab with IPADIC to tokenize Japanese sentences Model • Transformer (base) model as Vaswani et al. • RecoverSAT as Ran et al. We examined the four models with different numbers of segment, 10, 14, 21, and 29. Data set train validation test ASPEC 3,000,000 1,790 1,812 9

Evaluation • BLEU score. • Consistency score. → The ratio
of translations that satisfy the exact match of all the given constraints over the entire test corpus. ⬇ Final score. → the BLEU score using only the translations that exactly matched their RTVs. 10

Overall results Model BLEU Consistency Final Transformer 27.78 0.220 0.27
+ Append RTVs 25.57 1.000 26.75 RecoverSAT 25.76 0.197 0.16 + Forced translation with random order 26.93 0.962 26.98 + Forced translation with sorted order 27.16 0.961 27.10 + Forced translation with oracle order 31.14 0.966 31.02 “Append RTVs”: we insert RTVs at the tail of the output sentence without sorting. “random order”: we insert RTVs without sorting. “sorted order”: we insert RTVs in the order of the corresponding source words. “oracle order”: we insert RTVs in the same order as that in the reference. 11

Ablation of segments 12 RecoverSAT （RTVなし） Forced translation RTV の並び替えなし
Forced translation RTVを並び替え Forced translation 参照訳順のRTV

Example 13 RTV sine‐Gordon equation 2 kink solution soliton solution
of the KdV equation nonlinear Schroedinger equation breather solution envelope soliton were described. bleu τ reference The soliton solution of the KdV equation was explained, and next, sine‐Gordon equation and breather solution of the nonlinear Schroedinger equation and 2 kink solution and envelope soliton were described. without sorting sine‐Gordon equation and 2 kink solution and soliton solution of the KdV equation were explained , and nonlinear Schroedinger equation were described , and next , the breather solution and envelope soliton were described. 60.19 0.3 with sorting soliton solution of the KdV equation was explained , and next , sine‐Gordon equation and 2 kink solution and nonlinear Schroedinger equation breather solution , and envelope soliton were described. 76.56 0.6 oracle soliton solution of the KdV equation was explained , and next , sine‐Gordon equation and breather solution of the nonlinear Schroedinger equation , 2 kink solution and envelope soliton were described. 89.44 1 τ means Kendall rank correlation coeﬃcient of RTVs

Conclusions • We introduced a semi-autoregressive approach to tackle the
restricted translation task. • RecoverSAT could output almost all the RTVs. • The importance of the order of the RTVs was conﬁrmed. • In future work, investigating how to determine the best order to insert RTVs will be necessary. 14

Ablation of segments BLEU score Consistency score RecoverSAT without RTVs
Forced translation without sorting RTVs Forced translation with sorted order Forced translation with oracle order 15

WAT2021_Machine Translation with Pre-specified ...

WAT2021_Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

maskcott

More Decks by maskcott

Other Decks in Research

Featured

Transcript

Machine Translation with Pre-speciﬁed Target-side Words Using a Semi-autoregressive Model

Restricted translation task • This task requires the output sentence

Our approach to this task • Corpus reﬁnement • RecoverSAT

Corpus Reﬁnement • The ASPEC (Nakazawa et al., 2016) training

RecoverSAT Ja sentence BOS I will BOS look up I

Forced translation • We manually get RecoverSAT to generate the

Where did we input RTVs? • We place the i-th

Sorting RTVs Using Source Alignment • RecoverSAT outputs RTVs in

Experimental Setup • We used SentencePiece to tokenize the training

Evaluation • BLEU score. • Consistency score. → The ratio

Overall results Model BLEU Consistency Final Transformer 27.78 0.220 0.27

Ablation of segments 12 RecoverSAT （RTVなし） Forced translation RTV の並び替えなし

Example 13 RTV sine‐Gordon equation 2 kink solution soliton solution

Conclusions • We introduced a semi-autoregressive approach to tackle the

Ablation of segments BLEU score Consistency score RecoverSAT without RTVs