Upgrade to Pro — share decks privately, control downloads, hide ads and more …

WAT2021_Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

maskcott
August 06, 2021

WAT2021_Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

maskcott

August 06, 2021
Tweet

More Decks by maskcott

Other Decks in Research

Transcript

  1. Machine Translation with Pre-specified Target-side Words Using a Semi-autoregressive Model

    Seiichiro Kondo, Aomi Koyama, Tomoshige Kiyuna, Tosho Hirasawa, Mamoru Komachi Tokyo Metropolitan University WAT2021, Restricted Translation task 1
  2. Restricted translation task • This task requires the output sentence

    to contain all the pre-specified restricted target vocabularies (RTVs). • We are given a source sentence and a set of RTVs, and we are supposed to generate an output sentence that contains all the RTVs in the set. 2
  3. Our approach to this task • Corpus refinement • RecoverSAT

    (Ran et al., 2020) • Sorting RTVs Using Source Alignment 3
  4. Corpus Refinement • The ASPEC (Nakazawa et al., 2016) training

    sentences are ordered by sentence alignment scores. → The sentences with lower scores are considered relatively noisy data. • We used forward-translation to refine the latter half of the ASPEC training data following (Morishita et al., 2019). x 1 y 1 x n/2 y n/2 x n/2+1 y n/2+1 x n y n x 1 x n/2 x n/2+1 x n x 1 x n/2 y’ n/2+ 1 y’ Ja-En model Ja-En model train x 1 y 1 x n/2 y n/2 x n/2+1 y’ n/2+1 x n y’ n 4
  5. RecoverSAT Ja sentence BOS I will BOS look up I

    will EOS look up a a EOS up up a DEL BOS word EOS a BOS word Encoder Decoder t = 0 t = 0 t = 0 t = 0 t = 1 t = 2 t = 1 t = 2 t = 3 t = 2 t = 1 t = 1 5
  6. Forced translation • We manually get RecoverSAT to generate the

    RTV at the beginning of an arbitrary segment. • Once the RTV has been generated, the model predicts the remainder of the segment in a semi-autoregressive manner. 6
  7. Where did we input RTVs? • We place the i-th

    RTV at the P i -th segment as follows: N S : The number of segments , N V : The number of RTVs. 7
  8. Sorting RTVs Using Source Alignment • RecoverSAT outputs RTVs in

    the order where they are inserted. → The order of inserting RTVs is important for accurate translation. • We used GIZA++ to align each RTV with a word in the input sentence and sorted the RTVs in the order of their corresponding input words. 8 A, B, C RTVs list ~b~a~c~. src sentence B, A, C sorted RTVs list ~b~a~c~. src sentence A, B, C ~b~a~c~. get alignment using GIZA++ RecoverSAT
  9. Experimental Setup • We used SentencePiece to tokenize the training

    data, where the vocabulary size was set to 4,000. • When determining the insertion order of RTVs using GIZA++, we used MeCab with IPADIC to tokenize Japanese sentences Model • Transformer (base) model as Vaswani et al. • RecoverSAT as Ran et al. We examined the four models with different numbers of segment, 10, 14, 21, and 29. Data set train validation test ASPEC 3,000,000 1,790 1,812 9
  10. Evaluation • BLEU score. • Consistency score. → The ratio

    of translations that satisfy the exact match of all the given constraints over the entire test corpus. ⬇ Final score. → the BLEU score using only the translations that exactly matched their RTVs. 10
  11. Overall results Model BLEU Consistency Final Transformer 27.78 0.220 0.27

    + Append RTVs 25.57 1.000 26.75 RecoverSAT 25.76 0.197 0.16 + Forced translation with random order 26.93 0.962 26.98 + Forced translation with sorted order 27.16 0.961 27.10 + Forced translation with oracle order 31.14 0.966 31.02 “Append RTVs”: we insert RTVs at the tail of the output sentence without sorting. “random order”: we insert RTVs without sorting. “sorted order”: we insert RTVs in the order of the corresponding source words. “oracle order”: we insert RTVs in the same order as that in the reference. 11
  12. Ablation of segments 12 RecoverSAT (RTVなし) Forced translation RTV の並び替えなし

    Forced translation RTVを並び替え Forced translation 参照訳順のRTV
  13. Example 13 RTV sine‐Gordon equation 2 kink solution soliton solution

    of the KdV equation nonlinear Schroedinger equation breather solution envelope soliton were described. bleu τ reference The soliton solution of the KdV equation was explained, and next, sine‐Gordon equation and breather solution of the nonlinear Schroedinger equation and 2 kink solution and envelope soliton were described. without sorting sine‐Gordon equation and 2 kink solution and soliton solution of the KdV equation were explained , and nonlinear Schroedinger equation were described , and next , the breather solution and envelope soliton were described. 60.19 0.3 with sorting soliton solution of the KdV equation was explained , and next , sine‐Gordon equation and 2 kink solution and nonlinear Schroedinger equation breather solution , and envelope soliton were described. 76.56 0.6 oracle soliton solution of the KdV equation was explained , and next , sine‐Gordon equation and breather solution of the nonlinear Schroedinger equation , 2 kink solution and envelope soliton were described. 89.44 1 τ means Kendall rank correlation coefficient of RTVs
  14. Conclusions • We introduced a semi-autoregressive approach to tackle the

    restricted translation task. • RecoverSAT could output almost all the RTVs. • The importance of the order of the RTVs was confirmed. • In future work, investigating how to determine the best order to insert RTVs will be necessary. 14
  15. Ablation of segments BLEU score Consistency score RecoverSAT without RTVs

    Forced translation without sorting RTVs Forced translation with sorted order Forced translation with oracle order 15