Reading: An Effective Approach to Unsupervised Machine Translation (Artetxe et al. ACL 2019)

Augmented Human Communication Laboratory An Effective Approach to Unsupervised Machine
Translation Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre (University of Basque Country) ACL 2019 - https://www.aclweb.org/anthology/P19-1019/ Also in arXiv:1902.01313 - https://arxiv.org/abs/1902.01313 Presented by Katsuhito Sudoh (AHC Lab., NAIST) Study group of Unsupervised Machine Translation (2019/12/04)

Augmented Human Communication Laboratory Quick Summary • Unsupervised SMT +
Subwords + Theoretically well founded unsupervised tuning method + Joint refinement procedure + Unsupervised NMT • Initialization and on-the-fly back-translation • Outperforms UMT baselines by 5-7 BLEU points • WMT 2014/2016 Fr-En, De-En • Better than WMT 2014 winner (supervised SMT) in En-De 2

Augmented Human Communication Laboratory Brief History of Unsupervised MT •
Early attempt: Statistical Decipherment (2008-) • ISI (USC): Sujith Ravi, Qing Dou, Kevin Knight • Unsupervised NMT - Unsupervised SMT (2018-) • UBC: Mikel Artetxe • FB&Sorbonne: Guillaume Lample, Alexis Conneau • Unsupervised NMT+SMT hybrid • Lample+ (EMNLP 2018) • Marie and Fujita (arXiv preprint 2018) • Ren+ (AAAI 2019) 3

Augmented Human Communication Laboratory Quick Review of Phrase-based SMT •
Log-Linear Model ! " = argmax " ) " * = argmax " + , -, ., *, " • ., : Feature function such as • Phrase translation probability )0 " * = ∏ , ) 2 3, 2 4, , )0 * " • In-phrase lexical translation probability )567 " * , )567 * " • Phrasal reordering probability )8 " * • Distance-based / Orientation-based • Language model )8 " 4

Augmented Human Communication Laboratory SMT components • Language model: monolingual
• Word/phrase penalty: parameterless • Reordering model: parameterless (simple distortion) • Translation model (phrase table): bilingual 5

Augmented Human Communication Laboratory Proposed Method 1. Phrase table induction
using cross-lingual embedding mappings [Artetxe+ 2018b] (§3.1) 2. Subword-based scores on phrases (§3.2) • Subword-level lexical weights 3. Unsupervised tuning (§3.3) • Cyclic reconstruction loss 4. Joint (bidirectional) refinement (§3.4) 6

Augmented Human Communication Laboratory 1. Phrase Table • Using cosine
similarity over n-gram embeddings (phrase2vec) • Mapping embeddings into a cross-lingual space (VecMap) • Initialize phrase translation probabilities using 100 nearest neighbors with the formula: !" ̅ $ ̅ % = exp ⁄ cos ̅ %, ̅ $ / ∑ 12 exp ⁄ cos ̅ %, 3 $4 / • Lexical probs. !567 3 %8 3 $8 are estimated in the word level 7

Augmented Human Communication Laboratory 2. Subword-based Scores • Introduce subword-level
translation scores • Analogous to the lexical translation probabilities !"#$ score*+ ̅ - ̅ . = 0 1 max 5, max 7 sim*+ 9 -1 , 9 .7 • Levenshtein (edit) distance-based simple similarity function • Of course, not applicable to language pairs with different scripts sim*+ ̅ -, ̅ . = 1 − distance#?1@ ̅ -, ̅ . max ̅ - , ̅ . 8

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Motivation • How
can we optimize log-linear weights? • MERT [Och 2003] is not available in unsupervised settings • Previous work [Artetxe+ 2018b] used synthetic corpus • Lample et al. (2018b) don’t perform any tuning(!) • More principled approach • A novel unsupervised optimization objective • Cyclic consistency loss • Language model loss • Inspired by CycleGANs [Zhu+ 2017] and dual learning [He+ 2016] 9

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Objective • Bidirectional
loss ℒ = ℒ#$#%& ' + ℒ#$#%& ) + ℒ*+ ' + ℒ*+ ) • BLEU-based cycle consistency loss ℒ#$#%& ' = 1 − BLEU 2 3→5 2 5→3 ' , ' • Entropy-based language model loss ℒ*+ ' = LP ' ×LP ) × max 0, ℋ ) − ℋ 2 5→3 ' > • Length Penalty (LP): LP ' = max 1, 2 ?→@ 2 @→? 5 5 10

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Optimization • Line
search & greedy update using n-best (as MERT) • Naively !" -best: # $→& ' and # &→$ # $→& ' • Alternating optimization instead, for efficiency • Optimize one model with the fixed parameters on the other 11 BLEU () * () * MERT-like parameter update only works with relatively small number of parameters (~20).

Augmented Human Communication Laboratory 4. Joint Refinement • Use two
synthetic parallel corpora (10M sentences for efficiency) • Grammatical side should be used • ! ", $ %→' " ( ) → " phrase table *+ , - " → ) reordering table *. - , • ! $ '→% ) , ) ( " → ) phrase table *+ - , ) → F reordering table *. , - • 3 iterations of synthesis, table induction, and tuning 12

Augmented Human Communication Laboratory Footnote 8 • For the last
iteration, we do not perform any tuning and use default Moses weights instead, which we found to be more robust during development. Note, however, that using unsupervised tuning during the previous steps was still strongly beneficial. 13

Augmented Human Communication Laboratory Hybridization with NMT • Use the
unsupervised SMT to initialize dual NMT • ! ", $ %→' ()* + for NMT (+ → ") • ! $ '→% ()* + , " for NMT (" → +) • Progressively switch to NMT-based synthetic corpus . ()* = 0× max 0,1 − ⁄ 9 : • Mixing greedy search and sampling, inspired by Edunov et al. (2018) 14

Augmented Human Communication Laboratory Final NMT System • Ensemble of
all checkpoints from every 10 iterations • Beam search decoding • Hyperparameters: ! = 1$, % = 30 • 60 iterations 15

Augmented Human Communication Laboratory Experiments • Training: WMT 2014 Fr-En
/ De-En • Monolingual News Crawl (2007-2013) • #tokens: 749M (Fr), 1,606M (De), 2,109M (En) • 2,000 sentences chosen for tuning • Moses-based normalization, tokenization, truecasing • Test: in-house (Fr-En), newstest2014/2016 (De-En) • SMT: Moses, KenLM (5-gram), Z-MERT, FastAlign • NMT: fairseq-based Transformer-Big 16

Augmented Human Communication Laboratory Evaluation Metric • Case-sensitive BLEU •
multi-bleu.perl with pre-tokenization • SacreBLEU with detokenization • BLEU+case.mixed+lang.LANG+numrefs.1+smooth.exp+test.TE ST+tok.13a+version.1.2.11 • LANG={fr-en, en-fr, de-en, en-de} • TEST={wmt14/full, wmt16} 17

Augmented Human Communication Laboratory Results (Overall) 18

Augmented Human Communication Laboratory Results (SMT-NMT hybrid) • Best in
initial SMT and largest gain by hybridization 19

Augmented Human Communication Laboratory Results (unsupervised NMT vs. SMT) •
SMT’s modular architecture is suitable at first? 20

Augmented Human Communication Laboratory Results (supervised vs. unsupervised) • Outperformed
2014 winner in En-De! 21

Augmented Human Communication Laboratory Translation Examples (1/2) • Proposed system
produces much more fluent output 22

Augmented Human Communication Laboratory Translation Examples (2/2) 23

Augmented Human Communication Laboratory Conclusions • Extend their previous work
by: • Subword information (only applicable to the same scripts) • Principled unsupervised tuning • Joint bidirectional refinement • SMT-initialized NMT with iterative training • Performed best in WMT De-En newstest, even better than supervised SMT-based winner in WMT 2014 • Code available: https://github.com/artetxem/monoses 24

Reading: An Effective Approach to Unsupervised ...

Reading: An Effective Approach to Unsupervised Machine Translation (Artetxe et al. ACL 2019)

Katsuhito Sudoh

More Decks by Katsuhito Sudoh

Other Decks in Research

Featured

Transcript

Augmented Human Communication Laboratory An Effective Approach to Unsupervised Machine

Augmented Human Communication Laboratory Quick Summary • Unsupervised SMT +

Augmented Human Communication Laboratory Brief History of Unsupervised MT •

Augmented Human Communication Laboratory Quick Review of Phrase-based SMT •

Augmented Human Communication Laboratory SMT components • Language model: monolingual

Augmented Human Communication Laboratory Proposed Method 1. Phrase table induction

Augmented Human Communication Laboratory 1. Phrase Table • Using cosine

Augmented Human Communication Laboratory 2. Subword-based Scores • Introduce subword-level

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Motivation • How

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Objective • Bidirectional

Augmented Human Communication Laboratory 3. Unsupervised Tuning: Optimization • Line

Augmented Human Communication Laboratory 4. Joint Refinement • Use two

Augmented Human Communication Laboratory Footnote 8 • For the last

Augmented Human Communication Laboratory Hybridization with NMT • Use the

Augmented Human Communication Laboratory Final NMT System • Ensemble of

Augmented Human Communication Laboratory Experiments • Training: WMT 2014 Fr-En

Augmented Human Communication Laboratory Evaluation Metric • Case-sensitive BLEU •

Augmented Human Communication Laboratory Results (Overall) 18

Augmented Human Communication Laboratory Results (SMT-NMT hybrid) • Best in

Augmented Human Communication Laboratory Results (unsupervised NMT vs. SMT) •

Augmented Human Communication Laboratory Results (supervised vs. unsupervised) • Outperformed

Augmented Human Communication Laboratory Translation Examples (1/2) • Proposed system

Augmented Human Communication Laboratory Translation Examples (2/2) 23

Augmented Human Communication Laboratory Conclusions • Extend their previous work