Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Reading: An Effective Approach to Unsupervised ...

Katsuhito Sudoh
December 04, 2019

Reading: An Effective Approach to Unsupervised Machine Translation (Artetxe et al. ACL 2019)

Slides used in the internal reading group on unsupervised machine translation.

Katsuhito Sudoh

December 04, 2019
Tweet

More Decks by Katsuhito Sudoh

Other Decks in Research

Transcript

  1. Augmented Human Communication Laboratory An Effective Approach to Unsupervised Machine

    Translation Authors: Mikel Artetxe, Gorka Labaka, Eneko Agirre (University of Basque Country) ACL 2019 - https://www.aclweb.org/anthology/P19-1019/ Also in arXiv:1902.01313 - https://arxiv.org/abs/1902.01313 Presented by Katsuhito Sudoh (AHC Lab., NAIST) Study group of Unsupervised Machine Translation (2019/12/04)
  2. Augmented Human Communication Laboratory Quick Summary • Unsupervised SMT +

    Subwords + Theoretically well founded unsupervised tuning method + Joint refinement procedure + Unsupervised NMT • Initialization and on-the-fly back-translation • Outperforms UMT baselines by 5-7 BLEU points • WMT 2014/2016 Fr-En, De-En • Better than WMT 2014 winner (supervised SMT) in En-De 2
  3. Augmented Human Communication Laboratory Brief History of Unsupervised MT •

    Early attempt: Statistical Decipherment (2008-) • ISI (USC): Sujith Ravi, Qing Dou, Kevin Knight • Unsupervised NMT - Unsupervised SMT (2018-) • UBC: Mikel Artetxe • FB&Sorbonne: Guillaume Lample, Alexis Conneau • Unsupervised NMT+SMT hybrid • Lample+ (EMNLP 2018) • Marie and Fujita (arXiv preprint 2018) • Ren+ (AAAI 2019) 3
  4. Augmented Human Communication Laboratory Quick Review of Phrase-based SMT •

    Log-Linear Model ! " = argmax " ) " * = argmax " + , -, ., *, " • ., : Feature function such as • Phrase translation probability )0 " * = ∏ , ) 2 3, 2 4, , )0 * " • In-phrase lexical translation probability )567 " * , )567 * " • Phrasal reordering probability )8 " * • Distance-based / Orientation-based • Language model )8 " 4
  5. Augmented Human Communication Laboratory SMT components • Language model: monolingual

    • Word/phrase penalty: parameterless • Reordering model: parameterless (simple distortion) • Translation model (phrase table): bilingual 5
  6. Augmented Human Communication Laboratory Proposed Method 1. Phrase table induction

    using cross-lingual embedding mappings [Artetxe+ 2018b] (§3.1) 2. Subword-based scores on phrases (§3.2) • Subword-level lexical weights 3. Unsupervised tuning (§3.3) • Cyclic reconstruction loss 4. Joint (bidirectional) refinement (§3.4) 6
  7. Augmented Human Communication Laboratory 1. Phrase Table • Using cosine

    similarity over n-gram embeddings (phrase2vec) • Mapping embeddings into a cross-lingual space (VecMap) • Initialize phrase translation probabilities using 100 nearest neighbors with the formula: !" ̅ $ ̅ % = exp ⁄ cos ̅ %, ̅ $ / ∑ 12 exp ⁄ cos ̅ %, 3 $4 / • Lexical probs. !567 3 %8 3 $8 are estimated in the word level 7
  8. Augmented Human Communication Laboratory 2. Subword-based Scores • Introduce subword-level

    translation scores • Analogous to the lexical translation probabilities !"#$ score*+ ̅ - ̅ . = 0 1 max 5, max 7 sim*+ 9 -1 , 9 .7 • Levenshtein (edit) distance-based simple similarity function • Of course, not applicable to language pairs with different scripts sim*+ ̅ -, ̅ . = 1 − distance#?1@ ̅ -, ̅ . max ̅ - , ̅ . 8
  9. Augmented Human Communication Laboratory 3. Unsupervised Tuning: Motivation • How

    can we optimize log-linear weights? • MERT [Och 2003] is not available in unsupervised settings • Previous work [Artetxe+ 2018b] used synthetic corpus • Lample et al. (2018b) don’t perform any tuning(!) • More principled approach • A novel unsupervised optimization objective • Cyclic consistency loss • Language model loss • Inspired by CycleGANs [Zhu+ 2017] and dual learning [He+ 2016] 9
  10. Augmented Human Communication Laboratory 3. Unsupervised Tuning: Objective • Bidirectional

    loss ℒ = ℒ#$#%& ' + ℒ#$#%& ) + ℒ*+ ' + ℒ*+ ) • BLEU-based cycle consistency loss ℒ#$#%& ' = 1 − BLEU 2 3→5 2 5→3 ' , ' • Entropy-based language model loss ℒ*+ ' = LP ' ×LP ) × max 0, ℋ ) − ℋ 2 5→3 ' > • Length Penalty (LP): LP ' = max 1, 2 ?→@ 2 @→? 5 5 10
  11. Augmented Human Communication Laboratory 3. Unsupervised Tuning: Optimization • Line

    search & greedy update using n-best (as MERT) • Naively !" -best: # $→& ' and # &→$ # $→& ' • Alternating optimization instead, for efficiency • Optimize one model with the fixed parameters on the other 11 BLEU () * () * MERT-like parameter update only works with relatively small number of parameters (~20).
  12. Augmented Human Communication Laboratory 4. Joint Refinement • Use two

    synthetic parallel corpora (10M sentences for efficiency) • Grammatical side should be used • ! ", $ %→' " ( ) → " phrase table *+ , - " → ) reordering table *. - , • ! $ '→% ) , ) ( " → ) phrase table *+ - , ) → F reordering table *. , - • 3 iterations of synthesis, table induction, and tuning 12
  13. Augmented Human Communication Laboratory Footnote 8 • For the last

    iteration, we do not perform any tuning and use default Moses weights instead, which we found to be more robust during development. Note, however, that using unsupervised tuning during the previous steps was still strongly beneficial. 13
  14. Augmented Human Communication Laboratory Hybridization with NMT • Use the

    unsupervised SMT to initialize dual NMT • ! ", $ %→' ()* + for NMT (+ → ") • ! $ '→% ()* + , " for NMT (" → +) • Progressively switch to NMT-based synthetic corpus . ()* = 0× max 0,1 − ⁄ 9 : • Mixing greedy search and sampling, inspired by Edunov et al. (2018) 14
  15. Augmented Human Communication Laboratory Final NMT System • Ensemble of

    all checkpoints from every 10 iterations • Beam search decoding • Hyperparameters: ! = 1$, % = 30 • 60 iterations 15
  16. Augmented Human Communication Laboratory Experiments • Training: WMT 2014 Fr-En

    / De-En • Monolingual News Crawl (2007-2013) • #tokens: 749M (Fr), 1,606M (De), 2,109M (En) • 2,000 sentences chosen for tuning • Moses-based normalization, tokenization, truecasing • Test: in-house (Fr-En), newstest2014/2016 (De-En) • SMT: Moses, KenLM (5-gram), Z-MERT, FastAlign • NMT: fairseq-based Transformer-Big 16
  17. Augmented Human Communication Laboratory Evaluation Metric • Case-sensitive BLEU •

    multi-bleu.perl with pre-tokenization • SacreBLEU with detokenization • BLEU+case.mixed+lang.LANG+numrefs.1+smooth.exp+test.TE ST+tok.13a+version.1.2.11 • LANG={fr-en, en-fr, de-en, en-de} • TEST={wmt14/full, wmt16} 17
  18. Augmented Human Communication Laboratory Results (unsupervised NMT vs. SMT) •

    SMT’s modular architecture is suitable at first? 20
  19. Augmented Human Communication Laboratory Conclusions • Extend their previous work

    by: • Subword information (only applicable to the same scripts) • Principled unsupervised tuning • Joint bidirectional refinement • SMT-initialized NMT with iterative training • Performed best in WMT De-En newstest, even better than supervised SMT-based winner in WMT 2014 • Code available: https://github.com/artetxem/monoses 24