Upgrade to Pro — share decks privately, control downloads, hide ads and more …

hayakawa_N18-1032.pdf

 hayakawa_N18-1032.pdf

弊研究室で行なったNAACL読み会の発表資料です。

Avatar for onizuka laboratory

onizuka laboratory

July 25, 2018
Tweet

More Decks by onizuka laboratory

Other Decks in Research

Transcript

  1. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Universal Neural

    Machine Translation for Extremely Low Resource Languages Jiatao Gu∗, Hany Hassan, Jacob Devlin, Victor O.K. Li Takeshi Hayakawa, Osaka University At journal club for NAACL 2018 1
  2. Copyright © 2018 Takeshi Hayakawa. All rights reserved. High-resource languages

    2 1.80% 1.80% 1.90% 2.50% 2.90% 3.80% 4.20% 5.10% 6.10% 6.40% 52.70% 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% Polish Chinese Persian Italian Portuguese Japanese French Spanish Russian German English Language use in Internet https://w3techs.com/technologies/overview/content_language/
  3. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Low-resource languages

    (<=0.1%) Lithuanian @#% Bosnian 5% Northern Sami F7 Sinhalese E)? Lao ?F  Croatian C Malay 6BF Uzbek 3 Tuvalu *A Amharic 8)? Norwegian (A F Icelandic ?E$ Armenian A9% Burmese -A6 Abkhazian 0) Catalan C% Basque * Mongolian :EA Breton 0AF%= Marathi 6?F!F Serbian A- Macedonian 6$% Urdu A$ F Filipino /@.E Letzeburgesc h B 0A  Slovenian C3% Georgian >F Kanuri &@ Nepali '+FA Chamorro ;:C Latvian ?#- Albanian A*% Norwegian Nynorsk %F(= Khmer 9FA Malayalam 6?<F?8 Estonian #% Bengali 3EA Tamil 7A Faroese / CF Irish A?E$ Hindi ,E"F Galician @ Belarusian 3?AF Swahili D,@ Twi #  Azerbaijani A*; E Kazakh / Afrikaans /@FE  Pushto 1=# Kurdish A$ 3 https://w3techs.com/technologies/overview/content_language/
  4. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Domains in

    industry 4   http://h-bank.nict.go.jp/about.html
  5. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Transfer learning

    6 X  +      I Je  Please S’il vous plait ') $ Thanks in advance Merci d'avance  # Oh yes Oui, oui $ understood Je comprends &%" Good night Bonne nuit   Delicious delicieux  Brother frere  You Vous  # Good day bonne journée (! Thank you Merci  lovers amoureux Y  P(Y|X)    
  6. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Multi-source NMT

    • Multi-source encoder-decoder model Multi-Source Neural Translation (Zoph, 2016) • Multilingual zero-shot translation Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation (Johnson, 2017) • Training a model on many-to-one translation Fully Character-Level Neural Machine Translation without Explicit Segmentation (Lee, 2017) • Mixture of experts model Ensemble learning for multi-source neural machine translation (Garmash, 2016) 7
  7. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Solutions for

    low-resource languages • Challenges – Low-resource language will not have enough training examples to get a reliably trained model with its representation • Lexical-level Sharing – Shares semantic representation across all languages • Sentence-level Sharing – Shares syntactic order with another language with the use of monolingual representation 8
  8. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Universal Lexical

    Representation (ULR) • Lexicon Mapping to the Universal Token Space – define universal token into which all source languages are projected – each source word is represented as a mixture of universal tokens – Train the projection matrix embedding each language in a similar semantic space using seed (off-the-shelf) dictionary – Interpolate embedding of low-resource language using functional words, which is frequently appeared 9
  9. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Mixture of

    Language Experts (MoLE) • Language-sensitive module to model considering language-specific structure • Mixture experts for languages controlled by gating, which output weight of languages 11
  10. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Experimental settings

    • Languages – Low-resource: Romanian (Ro) / Latvian (Lv) / Korean (Ko) – High-resource: Czech (Cs), German (De), Greek (El), Spanish (Es), Finnish (Fi), French (Fr), Italian (It), Portuguese (Pt) and Russian (Ru) – Target: English (En) • Data – Parallel: WMT16, KPD, Europarl v8 + back translation (BT) – Monolingual: Wikipedia dumps 13
  11. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Experimental settings

    (cont.) • Architecture – one-layer bidirectional RNN encoder – two-layer attention-based RNN decoder – 512 LSTM units • Ablation study – Reference: vanilla and conventional multilingual NMT – Ablation: ULR, MoLE, and BT 14
  12. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Results (cont.)

    • Magnitude of influence: Es ≈ Pt > Fr ≈ It > Cs ≈ El > De > Fi • Reflected grammatical relatedness of Ro (low- resource) to these languages 17
  13. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Conclusion •

    Authors proposed a new universal machine translation approach that enables sharing resources between high resource languages and extremely low resource languages • Achieved 23 BLEU on Romanian-English WMT2016 using parallel corpus of 6k sentences, compared to the 18 BLEU of strong multilingual baseline system 18
  14. Copyright © 2018 Takeshi Hayakawa. All rights reserved. Future challenges

    • Variation of scores depending on languages: Ro, 20.51; Lv, 13.16; Ko, 6.14 • Expansion to other NMT models/algorithms • Significant gap by rich-resource NMT by 6 BLEU 19