Character-Level Machine Translation with Bi-scale RNNs by Joe Bullard - TMLS #4

Slide 1

Slide 1 text

Character-level Neural Machine Translation with Bi-scale RNN Decoder Joe Bullard Reactive TMLS #4

Slide 2

Slide 2 text

Overview 1. Words vs. Characters for Machine Translation 2. Review of Statistical and Neural Machine Translation 3. Bi-scale RNN: Decoding at Two Time Scales 4. Mini-Experiment: English-Japanese Translation

Slide 3

Slide 3 text

Words vs. Characters For Machine Translation

Slide 4

Slide 4 text

Words > Characters? ● Words (and morphemes) have meaning - intuitively useful ● Subwords* (character n-grams in vocab) have also been used successfully ● Character sequences are longer, creating: ○ Sparsity - bad for count-based probabilities in Statistical MT ○ Potential long-term dependencies - bad for RNNs in Neural MT (vanishing gradient) ● Thus machine translation typically uses words, subwords, or phrases * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

Slide 5

Slide 5 text

So words are better? Case closed?

Slide 6

Slide 6 text

Weakness of Words Word segmentation ● Requires a separately developed less-than-perfect system ● Not always trivial - スペースを使わない言語がある Morphological variations ● e.g. ran, running, runs are each “words” but share most of their meaning ● Rare variations are not learned well, or may not appear in training at all Out-of-vocabulary (OOV) words ● Mapped to special token - ignoring the word’s meaning during translation

Slide 7

Slide 7 text

Yeah, but how can characters help?

Slide 8

Slide 8 text

Strength of Characters ● Morphological variants often share character sequences ○ e.g. manger mange manges mangeons mangent (French verb) ○ e.g. sköldpadda sköldpaddan sköldpaddor sköldpaddorna (Swedish noun) ○ e.g. 食べる　食べます　食べた　食べました (Japanese verb) ● In such cases, a sort of fractional meaning might be learnable, as was shown with subwords* (sort of character n-grams) ● Rare and out-of-vocab variants may be treated like their related words * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

Slide 9

Slide 9 text

Character-level Translation Challenges 1. Learn highly nonlinear mapping from spelling ⇒ meaning, of a sentence 2. Generate long and coherent character sequences Scope ● Describe a model which addresses these challenges on target side only ● Examine and compare sample character-level machine translations

Slide 10

Slide 10 text

Review: Statistical and Neural Machine Translation

Slide 11

Slide 11 text

Machine Translation (MT) Source Language Target Language

Slide 12

Slide 12 text

Statistical MT (SMT) ● Estimate conditional probability of pairs of source and target phrases ● Often consists of separate components optimized independently ● Not tailored to specific languages, but... ● May not generalize well to significantly different languages (e.g. En-Ja) Image: from https://nlp.fi.muni.cz/web3/en/MachineTranslation

Slide 13

Slide 13 text

How is Neural Machine Translation different?

Slide 14

Slide 14 text

Neural MT (NMT) ● One network, tuned end-to- end to maximize translation performance Encode: ● source language ⇒ vector Decode: ● vector ⇒ Target language Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Slide 15

Slide 15 text

Components of NMT Encoder Inputs ● 1-of-k vector for word, char, ... Embedding Layer ● Map inputs to continuous space Encoder RNN ● Bidirectional RNN to generate variable- length summary vectors Decoder Attention Mechanism ● Allow decoder to focus on certain parts of source sentence when translating Decoder RNN ● Generate probability distribution over output vocabulary Word Sampling ● Choose current output, feed to next

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Encoder Bidirectional RNN Embedding layer Source sequence Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Decoder Final outputs Initial Outputs Decoder RNN Image (edited): https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Attention Mechanism Decoder RNN Attention Weighting Encoder RNN Image: https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/ At each decoder step, use different weighted sum of all encoder outputs

Slide 24

Slide 24 text

Training ● Trained end-to-end ● Maximize log-likelihood with Stochastic gradient descent (SGD) ● Most likely source sequence selected through Beam-search Image (edited): https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

Slide 25

Slide 25 text

How can we use characters with this?

Slide 26

Slide 26 text

Bi-scale RNN: Decoding at Two Time Scales

Slide 27

Slide 27 text

Bi-scale RNN Decoder Intuition ● Written words are composed of characters ● Can one network process characters “faster” and words “slower”? Model ● Not literally time, but frequency of updates ● The faster layer (FL) will update state like a typical RNN ● The slower layer (SL) will only update when the faster layer resets ● SL will retain its state until FL finishes processing the current chunk (presumably characters of a word), making it “slower”

Slide 28

Slide 28 text

Bi-scale RNN Decoder ● h1 the “faster” layer (FL) ● h2 the “slower” layer (SL) ● Same size ● Each has an associated gating unit g1 and g2 and gated activations ● Slower layer uses the Faster layer’s gate to determine its own activation

Slide 29

Slide 29 text

Faster Layer (FL) FL normal output: FL gated output: Summary: ● FL output is adaptive combination of previous FL and SL activations FL activation: ● When FL “resets” (g1 ≈ 1), the SL will have greater influence in next step SL activation:

Slide 30

Slide 30 text

Slower Layer (SL) is Controlled by Faster Layer (FL) SL normal output: Summary: ● Slower layer only updates its activation when Faster layer resets SL candidate activation: SL gated output: SL reset activation: ● The reset activation is similar to that of FL on previous slide ● When FL gets rid of something, SL takes it

Slide 31

Slide 31 text

Outputs ● Outputs h1 and h2 are simply concatenated

Slide 32

Slide 32 text

Paper Experiments (summary) Training ● Trained with Stochastic Gradient Descent (SGD) with Adam ● English to (Czech, German, Finnish, Russian), data from Euro-Parl. ● Source inputs are subwords* (character n-grams in a vocabulary) Results ● Character-level decoding generally outperformed subword ● Bi-scale decoder was not always better than GRU decoder * Sennrich et al. (2015). Neural Machine Translation of Rare Words with Subword Units. arxiv:1508.07909

Slide 33

Slide 33 text

Mini-Experiment: English-Japanese Translation

Slide 34

Slide 34 text

Considerations for Japanese ● Most research uses European languages, alphabet ⇒ alphabet ● Sometimes different morphology (e.g. Fi-En) or scripts (e.g. Ru-En) ● The Bi-scale paper performed well in such cases ● Japanese uses Kanji, two syllabaries, and sometimes an alphabet ● Large number of characters compared to alphabetic scripts ● Words often consist of only a few characters Does the Bi-scale decoder make sense for Japanese?

Slide 35

Slide 35 text

Mini-Experiment Dataset ● ~200,000 human-translated sentences collected from Tatoeba.org ● Slightly modified preprocessing (higher character limit, no punct. removal) Model ● Smaller network size compared to paper (b/c small dataset, limited time) ● Still training a more realistic model right now Results ● Under these conditions, bad overfitting problems, but still interesting

Slide 36

Slide 36 text

Observations - Alternative wordings Source: The baby is sleeping on the bed . Truth : 赤ちゃんはベッドで寝ています。 Output: 赤ん坊　はベッドに寝ていた。 Source: You seem to have gained some weight . Truth : 　　少し体重が増えたようですね。 Output: 君は少し体重が増えたようだ。 Source: You only imagine you’ve heard it . Truth : それは君の想像だ。 Output: それは___想像であることだけだ。

Slide 37

Slide 37 text

References Papers: ● Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. (2016). A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. arXiv:1603.06147 ● Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473v7 ● Rico Sennrich, Barry Haddow, Alexandra Birch. (2015). Neural Machine Translation of Rare Words with Subword Units. Arxiv:1508.07909 NMT Links (some images were used here, with permission) ● http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-with-gpus/ ● https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2 ● https://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3

Slide 38

Slide 38 text

Questions?