Reading: An Effective Approach to Unsupervised Machine Translation (Artetxe et al. ACL 2019)

Slide 1

Slide 1 text

Augmented Human Communication Laboratory An Effective Approach to Unsupervised Machine Translation Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre (University of Basque Country) ACL 2019 - https://www.aclweb.org/anthology/P19-1019/ Also in arXiv:1902.01313 - https://arxiv.org/abs/1902.01313 Presented by Katsuhito Sudoh (AHC Lab., NAIST) Study group of Unsupervised Machine Translation (2019/12/04)

Slide 2

Slide 2 text

Augmented Human Communication Laboratory Quick Summary • Unsupervised SMT + Subwords + Theoretically well founded unsupervised tuning method + Joint refinement procedure + Unsupervised NMT • Initialization and on-the-fly back-translation • Outperforms UMT baselines by 5-7 BLEU points • WMT 2014/2016 Fr-En, De-En • Better than WMT 2014 winner (supervised SMT) in En-De 2

Slide 3

Slide 3 text

Augmented Human Communication Laboratory Brief History of Unsupervised MT • Early attempt: Statistical Decipherment (2008-) • ISI (USC): Sujith Ravi, Qing Dou, Kevin Knight • Unsupervised NMT - Unsupervised SMT (2018-) • UBC: Mikel Artetxe • FB&Sorbonne: Guillaume Lample, Alexis Conneau • Unsupervised NMT+SMT hybrid • Lample+ (EMNLP 2018) • Marie and Fujita (arXiv preprint 2018) • Ren+ (AAAI 2019) 3

Slide 4

Slide 4 text

Augmented Human Communication Laboratory Quick Review of Phrase-based SMT • Log-Linear Model ! " = argmax " ) " * = argmax " + , -, ., *, " • ., : Feature function such as • Phrase translation probability )0 " * = ∏ , ) 2 3, 2 4, , )0 * " • In-phrase lexical translation probability )567 " * , )567 * " • Phrasal reordering probability )8 " * • Distance-based / Orientation-based • Language model )8 " 4

Slide 5

Slide 5 text

Augmented Human Communication Laboratory SMT components • Language model: monolingual • Word/phrase penalty: parameterless • Reordering model: parameterless (simple distortion) • Translation model (phrase table): bilingual 5

Slide 6

Slide 6 text

Augmented Human Communication Laboratory Proposed Method 1. Phrase table induction using cross-lingual embedding mappings [Artetxe+ 2018b] (§3.1) 2. Subword-based scores on phrases (§3.2) • Subword-level lexical weights 3. Unsupervised tuning (§3.3) • Cyclic reconstruction loss 4. Joint (bidirectional) refinement (§3.4) 6

Slide 7

Slide 7 text

Augmented Human Communication Laboratory 1. Phrase Table • Using cosine similarity over n-gram embeddings (phrase2vec) • Mapping embeddings into a cross-lingual space (VecMap) • Initialize phrase translation probabilities using 100 nearest neighbors with the formula: !" ̅ $ ̅ % = exp ⁄ cos ̅ %, ̅ $ / ∑ 12 exp ⁄ cos ̅ %, 3 $4 / • Lexical probs. !567 3 %8 3 $8 are estimated in the word level 7

Slide 8

Slide 8 text

Augmented Human Communication Laboratory 2. Subword-based Scores • Introduce subword-level translation scores • Analogous to the lexical translation probabilities !"#$ score*+ ̅ - ̅ . = 0 1 max 5, max 7 sim*+ 9 -1 , 9 .7 • Levenshtein (edit) distance-based simple similarity function • Of course, not applicable to language pairs with different scripts sim*+ ̅ -, ̅ . = 1 − distance#?1@ ̅ -, ̅ . max ̅ - , ̅ . 8

Slide 9

Slide 9 text

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Motivation • How can we optimize log-linear weights? • MERT [Och 2003] is not available in unsupervised settings • Previous work [Artetxe+ 2018b] used synthetic corpus • Lample et al. (2018b) don’t perform any tuning(!) • More principled approach • A novel unsupervised optimization objective • Cyclic consistency loss • Language model loss • Inspired by CycleGANs [Zhu+ 2017] and dual learning [He+ 2016] 9

Slide 10

Slide 10 text

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Objective • Bidirectional loss ℒ = ℒ#$#%& ' + ℒ#$#%& ) + ℒ*+ ' + ℒ*+ ) • BLEU-based cycle consistency loss ℒ#$#%& ' = 1 − BLEU 2 3→5 2 5→3 ' , ' • Entropy-based language model loss ℒ*+ ' = LP ' ×LP ) × max 0, ℋ ) − ℋ 2 5→3 ' > • Length Penalty (LP): LP ' = max 1, 2 ?→@ 2 @→? 5 5 10

Slide 11

Slide 11 text

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Optimization • Line search & greedy update using n-best (as MERT) • Naively !" -best: # $→& ' and # &→$ # $→& ' • Alternating optimization instead, for efficiency • Optimize one model with the fixed parameters on the other 11 BLEU () * () * MERT-like parameter update only works with relatively small number of parameters (~20).

Slide 12

Slide 12 text

Augmented Human Communication Laboratory 4. Joint Refinement • Use two synthetic parallel corpora (10M sentences for efficiency) • Grammatical side should be used • ! ", $ %→' " ( ) → " phrase table *+ , - " → ) reordering table *. - , • ! $ '→% ) , ) ( " → ) phrase table *+ - , ) → F reordering table *. , - • 3 iterations of synthesis, table induction, and tuning 12

Slide 13

Slide 13 text

Augmented Human Communication Laboratory Footnote 8 • For the last iteration, we do not perform any tuning and use default Moses weights instead, which we found to be more robust during development. Note, however, that using unsupervised tuning during the previous steps was still strongly beneficial. 13

Slide 14

Slide 14 text

Augmented Human Communication Laboratory Hybridization with NMT • Use the unsupervised SMT to initialize dual NMT • ! ", $ %→' ()* + for NMT (+ → ") • ! $ '→% ()* + , " for NMT (" → +) • Progressively switch to NMT-based synthetic corpus . ()* = 0× max 0,1 − ⁄ 9 : • Mixing greedy search and sampling, inspired by Edunov et al. (2018) 14

Slide 15

Slide 15 text

Augmented Human Communication Laboratory Final NMT System • Ensemble of all checkpoints from every 10 iterations • Beam search decoding • Hyperparameters: ! = 1$, % = 30 • 60 iterations 15

Slide 16

Slide 16 text

Augmented Human Communication Laboratory Experiments • Training: WMT 2014 Fr-En / De-En • Monolingual News Crawl (2007-2013) • #tokens: 749M (Fr), 1,606M (De), 2,109M (En) • 2,000 sentences chosen for tuning • Moses-based normalization, tokenization, truecasing • Test: in-house (Fr-En), newstest2014/2016 (De-En) • SMT: Moses, KenLM (5-gram), Z-MERT, FastAlign • NMT: fairseq-based Transformer-Big 16

Slide 17

Slide 17 text

Augmented Human Communication Laboratory Evaluation Metric • Case-sensitive BLEU • multi-bleu.perl with pre-tokenization • SacreBLEU with detokenization • BLEU+case.mixed+lang.LANG+numrefs.1+smooth.exp+test.TE ST+tok.13a+version.1.2.11 • LANG={fr-en, en-fr, de-en, en-de} • TEST={wmt14/full, wmt16} 17

Slide 18

Slide 18 text

Augmented Human Communication Laboratory Results (Overall) 18

Slide 19

Slide 19 text

Augmented Human Communication Laboratory Results (SMT-NMT hybrid) • Best in initial SMT and largest gain by hybridization 19

Slide 20

Slide 20 text

Augmented Human Communication Laboratory Results (unsupervised NMT vs. SMT) • SMT’s modular architecture is suitable at first? 20

Slide 21

Slide 21 text

Augmented Human Communication Laboratory Results (supervised vs. unsupervised) • Outperformed 2014 winner in En-De! 21

Slide 22

Slide 22 text

Augmented Human Communication Laboratory Translation Examples (1/2) • Proposed system produces much more fluent output 22

Slide 23

Slide 23 text

Augmented Human Communication Laboratory Translation Examples (2/2) 23

Slide 24

Slide 24 text

Augmented Human Communication Laboratory Conclusions • Extend their previous work by: • Subword information (only applicable to the same scripts) • Principled unsupervised tuning • Joint bidirectional refinement • SMT-initialized NMT with iterative training • Performed best in WMT De-En newstest, even better than supervised SMT-based winner in WMT 2014 • Code available: https://github.com/artetxem/monoses 24