Unsupervised Neural Machine Translation
with Weight Sharing(ACL2018)
Zhen Yang, Wei Chen , Feng Wang, Bo Xu
NLP
2018/08/03
Okazaki Lab : B4 S. Shimadu
Slide 2
Slide 2 text
Summary
• Objective
• Machine translation which trains without using any
labeled data
• Improvement of unsupervised NMT
• weak in keeping the unique and internal characteristics of each
language
• Background
• The monolingual corpora is easy to be collected
• shared-latent space assumption
• Assumption that a pair of sentences from two different
languages can be mapped to a same latent representation
2018/8/2 NLP 2
Slide 3
Slide 3 text
Summary
• Related research
1. Source language à pivot language à target language
[Saha et al. 2016; Cheng et al., 2017]
2. Single encoder and a single decoder for both languages
[Lample et al. 2017]
3. Single encoder and two independent decoders
[Artetxe et al. 2017b]
• Above both use a single shared encoder to guarantee the
shared latent space.
2018/8/2 NLP 3
Slide 4
Slide 4 text
Summary
• Proposed idea
• The weight-sharing constraint
• The embedding-reinforced encoders
• Two different GANs
• Transformer for the encoder and decoder
• Experiment result
• Compared with several baseline systems
• Achieve significant improvements
• Reveal that it deserves to investigate the order
information within self-attention
2018/8/2 NLP 4
Slide 5
Slide 5 text
Model Architecture
• Based on the AE and GAN
• Local discriminator : multi-layer perceptron
• Global discriminator : based on CNN
2018/8/2 NLP 5
Slide 6
Slide 6 text
Model Architecture
• Weight-sharing constraint
• Based on the shared-latent space assumption
• Share the weights of the last few layers of encoder
• Extracting high-level representations of the input
sentences
• Share the first few layers of the decoder
• Decode high-level representations
2018/8/2 NLP 6
Slide 7
Slide 7 text
Model Architecture
• Embedding reinforced encoder
• Pre-trained cross-lingual embeddingsthat are kept fixed
during training
• The final output sequence of the encoder computed
!"
= $ ⊙ ! + 1 − $ ⊙ *
E : input sequence embedding vectors
H : initial output sequence of the encoder stack
g : gate unit and compute as $ = +(-.
* + -/
! + 0)
• -.
, -/
345 0 are trainable parameters and they are
shared by the two encoders.
2018/8/2 NLP 7
Slide 8
Slide 8 text
Methodology
• Back-translation
• Utilize for the cross language training
• How to get the pseudo-parallel corpus
• source / target sentence à target / source sentence
• Utilize the pseudo-parallel corpus to reconstruct the
original sentence from its translation
2018/8/2 NLP 8
Slide 9
Slide 9 text
Methodology
• Local GAN
• To further enforce the shared-latent space, train a
discriminative neural network
• Takes the output of the encoder and produces a binary
prediction about the input sentence
• Local discriminator is trained to predict the exact language
• Encoders are trained to fool the local disctiminator
2018/8/2 NLP 9
Slide 10
Slide 10 text
Methodology
• Global GAN
• Fine tune the whole model
• Utilized to update the whole parameters of the
proposed model
2018/8/2 NLP 10
Slide 11
Slide 11 text
Methodology
• Training
1. Train with AE, back-translation and the local GANs
2. No improvement is achieved on the development set
3. Fine tune the proposed model with the global GANs
2018/8/2 NLP 11
Slide 12
Slide 12 text
Evaluation
• Evaluated by computing the BLEU score
• Two-step translation process
• Translate the source sentences to the target language
• The resulting sentences back to the source language
• Performance is finally averaged over two directions
2018/8/2 NLP 12
Slide 13
Slide 13 text
Experiment Result(1)
2018/8/2 NLP 13
• Vary the number of weight-sharing layers in the AEs
• Verifies that the shared encoder is detrimental to the
performance especially distant language pairs
Slide 14
Slide 14 text
Experiment Result(2)
2018/8/2 NLP 14
• Only trained with monolingual data effectively
learns to use the context information and the
internal structure of each language
Slide 15
Slide 15 text
Experiment Result(3)
• The most critical component is the weight-sharing
constraint
• The embedding-reinforced encoder brings some
improvement on all of the translation tasks
2018/8/2 NLP 15
Slide 16
Slide 16 text
Experiment Result(4)
• Remove the directional self-attention à -0.3 BLEU
• Deserves more efforts to investigate the temporal
order information
• The GANs significantly improve the performance
2018/8/2 NLP 16
Slide 17
Slide 17 text
Conclusion
• They proposed
• The weight-sharing constraint in unsupervised NMT
• The embedding-reinforced encoders
• Local GAN and global GAN
• Achieves significant improvement
• Reveals that the shared encoder is really a bottleneck
• Future work
• Investigate how to utilize the monolingual data more
effectively
• Explore how to reinforce the temporal order information
2018/8/2 NLP 17