Character-Level Machine Translation with Bi-scale RNNs by Joe Bullard - TMLS #4

Character-level Neural Machine Translation with Bi-scale RNN Decoder Joe Bullard
Reactive TMLS #4

Overview 1. Words vs. Characters for Machine Translation 2. Review
of Statistical and Neural Machine Translation 3. Bi-scale RNN: Decoding at Two Time Scales 4. Mini-Experiment: English-Japanese Translation

Words vs. Characters For Machine Translation

Words > Characters? • Words (and morphemes) have meaning -
intuitively useful • Subwords* (character n-grams in vocab) have also been used successfully • Character sequences are longer, creating: ◦ Sparsity - bad for count-based probabilities in Statistical MT ◦ Potential long-term dependencies - bad for RNNs in Neural MT (vanishing gradient) • Thus machine translation typically uses words, subwords, or phrases * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

So words are better? Case closed?

Weakness of Words Word segmentation • Requires a separately developed
less-than-perfect system • Not always trivial - スペースを使わない言語がある Morphological variations • e.g. ran, running, runs are each “words” but share most of their meaning • Rare variations are not learned well, or may not appear in training at all Out-of-vocabulary (OOV) words • Mapped to special token - ignoring the word’s meaning during translation

Yeah, but how can characters help?

Strength of Characters • Morphological variants often share character sequences
◦ e.g. manger mange manges mangeons mangent (French verb) ◦ e.g. sköldpadda sköldpaddan sköldpaddor sköldpaddorna (Swedish noun) ◦ e.g. 食べる　食べます　食べた　食べました (Japanese verb) • In such cases, a sort of fractional meaning might be learnable, as was shown with subwords* (sort of character n-grams) • Rare and out-of-vocab variants may be treated like their related words * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

Character-level Translation Challenges 1. Learn highly nonlinear mapping from spelling
⇒ meaning, of a sentence 2. Generate long and coherent character sequences Scope • Describe a model which addresses these challenges on target side only • Examine and compare sample character-level machine translations

Review: Statistical and Neural Machine Translation

Machine Translation (MT) Source Language Target Language

Statistical MT (SMT) • Estimate conditional probability of pairs of
source and target phrases • Often consists of separate components optimized independently • Not tailored to specific languages, but... • May not generalize well to significantly different languages (e.g. En-Ja) Image: from https://nlp.fi.muni.cz/web3/en/MachineTranslation

How is Neural Machine Translation different?

Neural MT (NMT) • One network, tuned end-to- end to
maximize translation performance Encode: • source language ⇒ vector Decode: • vector ⇒ Target language Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Components of NMT Encoder Inputs • 1-of-k vector for word,
char, ... Embedding Layer • Map inputs to continuous space Encoder RNN • Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism • Allow decoder to focus on certain parts of source sentence when translating Decoder RNN • Generate probability distribution over output vocabulary Word Sampling • Choose current output, feed to next

Encoder Bidirectional RNN Embedding layer Source sequence Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Decoder Final outputs Initial Outputs Decoder RNN Image (edited): https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Attention Mechanism Decoder RNN Attention Weighting Encoder RNN Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
At each decoder step, use different weighted sum of all encoder outputs

Training • Trained end-to-end • Maximize log-likelihood with Stochastic gradient
descent (SGD) • Most likely source sequence selected through Beam-search Image (edited): https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

How can we use characters with this?

Bi-scale RNN: Decoding at Two Time Scales

Bi-scale RNN Decoder Intuition • Written words are composed of
characters • Can one network process characters “faster” and words “slower”? Model • Not literally time, but frequency of updates • The faster layer (FL) will update state like a typical RNN • The slower layer (SL) will only update when the faster layer resets • SL will retain its state until FL finishes processing the current chunk (presumably characters of a word), making it “slower”

Bi-scale RNN Decoder • h1 the “faster” layer (FL) •
h2 the “slower” layer (SL) • Same size • Each has an associated gating unit g1 and g2 and gated activations • Slower layer uses the Faster layer’s gate to determine its own activation

Faster Layer (FL) FL normal output: FL gated output: Summary:
• FL output is adaptive combination of previous FL and SL activations FL activation: • When FL “resets” (g1 ≈ 1), the SL will have greater influence in next step SL activation:

Slower Layer (SL) is Controlled by Faster Layer (FL) SL
normal output: Summary: • Slower layer only updates its activation when Faster layer resets SL candidate activation: SL gated output: SL reset activation: • The reset activation is similar to that of FL on previous slide • When FL gets rid of something, SL takes it

Outputs • Outputs h1 and h2 are simply concatenated

Paper Experiments (summary) Training • Trained with Stochastic Gradient Descent
(SGD) with Adam • English to (Czech, German, Finnish, Russian), data from Euro-Parl. • Source inputs are subwords* (character n-grams in a vocabulary) Results • Character-level decoding generally outperformed subword • Bi-scale decoder was not always better than GRU decoder * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

Mini-Experiment: English-Japanese Translation

Considerations for Japanese • Most research uses European languages, alphabet
⇒ alphabet • Sometimes different morphology (e.g. Fi-En) or scripts (e.g. Ru-En) • The Bi-scale paper performed well in such cases • Japanese uses Kanji, two syllabaries, and sometimes an alphabet • Large number of characters compared to alphabetic scripts • Words often consist of only a few characters Does the Bi-scale decoder make sense for Japanese?

Mini-Experiment Dataset • ~200,000 human-translated sentences collected from Tatoeba.org •
Slightly modified preprocessing (higher character limit, no punct. removal) Model • Smaller network size compared to paper (b/c small dataset, limited time) • Still training a more realistic model right now Results • Under these conditions, bad overfitting problems, but still interesting

Observations - Alternative wordings Source: The baby is sleeping on
the bed . Truth : 赤ちゃんはベッドで寝ています。 Output: 赤ん坊　はベッドに寝ていた。 Source: You seem to have gained some weight . Truth : 　　少し体重が増えたようですね。 Output: 君は少し体重が増えたようだ。 Source: You only imagine you’ve heard it . Truth : それは君の想像だ。 Output: それは___想像であることだけだ。

References Papers: • Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio.
(2016). A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. arXiv:1603.06147 • Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473v7 • Rico Sennrich, Barry Haddow, Alexandra Birch. (2015). Neural Machine Translation of Rare Words with Subword Units. Arxiv:1508.07909 NMT Links (some images were used here, with permission) • http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ • https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2 • https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

Questions?

Character-Level Machine Translation with Bi-sca...

Character-Level Machine Translation with Bi-scale RNNs by Joe Bullard - TMLS #4

More Decks by Tokyo Machine Learning Society

Other Decks in Technology

Featured

Transcript