Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Character-Level Machine Translation with Bi-scale RNNs by Joe Bullard - TMLS #4

Character-Level Machine Translation with Bi-scale RNNs by Joe Bullard - TMLS #4

Machine translation systems typically rely on pre-segmented words as inputs. However, word segmentation is not always a trivial task (e.g. Japanese), and some arguments made in its favor may be less relevant for modern Neural Machine Translation (NMT). A recent paper presented interesting results for *character-level* translation using a novel "Bi-scale RNN" decoder, which processes inputs at two time-scales. This presentation will cover basics of NMT, explain the structure of the Bi-scale RNN, and discuss some language specific considerations for Japanese translation.

More Decks by Tokyo Machine Learning Society

Other Decks in Technology

Transcript

  1. Overview 1. Words vs. Characters for Machine Translation 2. Review

    of Statistical and Neural Machine Translation 3. Bi-scale RNN: Decoding at Two Time Scales 4. Mini-Experiment: English-Japanese Translation
  2. Words > Characters? • Words (and morphemes) have meaning -

    intuitively useful • Subwords* (character n-grams in vocab) have also been used successfully • Character sequences are longer, creating: ◦ Sparsity - bad for count-based probabilities in Statistical MT ◦ Potential long-term dependencies - bad for RNNs in Neural MT (vanishing gradient) • Thus machine translation typically uses words, subwords, or phrases * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909
  3. Weakness of Words Word segmentation • Requires a separately developed

    less-than-perfect system • Not always trivial - スペースを使わない言語がある Morphological variations • e.g. ran, running, runs are each “words” but share most of their meaning • Rare variations are not learned well, or may not appear in training at all Out-of-vocabulary (OOV) words • Mapped to special token - ignoring the word’s meaning during translation
  4. Strength of Characters • Morphological variants often share character sequences

    ◦ e.g. manger mange manges mangeons mangent (French verb) ◦ e.g. sköldpadda sköldpaddan sköldpaddor sköldpaddorna (Swedish noun) ◦ e.g. 食べる 食べます 食べた 食べました (Japanese verb) • In such cases, a sort of fractional meaning might be learnable, as was shown with subwords* (sort of character n-grams) • Rare and out-of-vocab variants may be treated like their related words * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909
  5. Character-level Translation Challenges 1. Learn highly nonlinear mapping from spelling

    ⇒ meaning, of a sentence 2. Generate long and coherent character sequences Scope • Describe a model which addresses these challenges on target side only • Examine and compare sample character-level machine translations
  6. Statistical MT (SMT) • Estimate conditional probability of pairs of

    source and target phrases • Often consists of separate components optimized independently • Not tailored to specific languages, but... • May not generalize well to significantly different languages (e.g. En-Ja) Image: from https://nlp.fi.muni.cz/web3/en/MachineTranslation
  7. Neural MT (NMT) • One network, tuned end-to- end to

    maximize translation performance Encode: • source language ⇒ vector Decode: • vector ⇒ Target language Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
  8. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  9. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  10. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  11. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  12. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  13. Components of NMT Encoder Inputs • 1-of-k vector for word,

    char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next
  14. Training • Trained end-to-end • Maximize log-likelihood with Stochastic gradient

    descent (SGD) • Most likely source sequence selected through Beam-search Image (edited): https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/
  15. Bi-scale RNN Decoder Intuition • Written words are composed of

    characters • Can one network process characters “faster” and words “slower”? Model • Not literally time, but frequency of updates • The faster layer (FL) will update state like a typical RNN • The slower layer (SL) will only update when the faster layer resets • SL will retain its state until FL finishes processing the current chunk (presumably characters of a word), making it “slower”
  16. Bi-scale RNN Decoder • h1 the “faster” layer (FL) •

    h2 the “slower” layer (SL) • Same size • Each has an associated gating unit g1 and g2 and gated activations • Slower layer uses the Faster layer’s gate to determine its own activation
  17. Faster Layer (FL) FL normal output: FL gated output: Summary:

    • FL output is adaptive combination of previous FL and SL activations FL activation: • When FL “resets” (g1 ≈ 1), the SL will have greater influence in next step SL activation:
  18. Slower Layer (SL) is Controlled by Faster Layer (FL) SL

    normal output: Summary: • Slower layer only updates its activation when Faster layer resets SL candidate activation: SL gated output: SL reset activation: • The reset activation is similar to that of FL on previous slide • When FL gets rid of something, SL takes it
  19. Paper Experiments (summary) Training • Trained with Stochastic Gradient Descent

    (SGD) with Adam • English to (Czech, German, Finnish, Russian), data from Euro-Parl. • Source inputs are subwords* (character n-grams in a vocabulary) Results • Character-level decoding generally outperformed subword • Bi-scale decoder was not always better than GRU decoder * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909
  20. Considerations for Japanese • Most research uses European languages, alphabet

    ⇒ alphabet • Sometimes different morphology (e.g. Fi-En) or scripts (e.g. Ru-En) • The Bi-scale paper performed well in such cases • Japanese uses Kanji, two syllabaries, and sometimes an alphabet • Large number of characters compared to alphabetic scripts • Words often consist of only a few characters Does the Bi-scale decoder make sense for Japanese?
  21. Mini-Experiment Dataset • ~200,000 human-translated sentences collected from Tatoeba.org •

    Slightly modified preprocessing (higher character limit, no punct. removal) Model • Smaller network size compared to paper (b/c small dataset, limited time) • Still training a more realistic model right now Results • Under these conditions, bad overfitting problems, but still interesting
  22. Observations - Alternative wordings Source: The baby is sleeping on

    the bed . Truth : 赤ちゃんはベッドで寝ています 。 Output: 赤ん坊 はベッドに寝ていた 。 Source: You seem to have gained some weight . Truth :   少し体重が増えたようですね 。 Output: 君は少し体重が増えたようだ 。 Source: You only imagine you’ve heard it . Truth : それは君の想像だ 。 Output: それは___想像であることだけだ 。
  23. References Papers: • Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio.

    (2016). A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. arXiv:1603.06147 • Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473v7 • Rico Sennrich, Barry Haddow, Alexandra Birch. (2015). Neural Machine Translation of Rare Words with Subword Units. Arxiv:1508.07909 NMT Links (some images were used here, with permission) • http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ • https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2 • https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3