Slide 1

Slide 1 text

Unsupervised Neural Machine Translation with Weight Sharing(ACL2018) Zhen Yang, Wei Chen , Feng Wang, Bo Xu      NLP 2018/08/03 Okazaki Lab : B4 S. Shimadu

Slide 2

Slide 2 text

Summary • Objective • Machine translation which trains without using any labeled data • Improvement of unsupervised NMT • weak in keeping the unique and internal characteristics of each language • Background • The monolingual corpora is easy to be collected • shared-latent space assumption • Assumption that a pair of sentences from two different languages can be mapped to a same latent representation 2018/8/2  NLP 2

Slide 3

Slide 3 text

Summary • Related research 1. Source language à pivot language à target language [Saha et al. 2016; Cheng et al., 2017] 2. Single encoder and a single decoder for both languages [Lample et al. 2017] 3. Single encoder and two independent decoders [Artetxe et al. 2017b] • Above both use a single shared encoder to guarantee the shared latent space. 2018/8/2  NLP 3

Slide 4

Slide 4 text

Summary • Proposed idea • The weight-sharing constraint • The embedding-reinforced encoders • Two different GANs • Transformer for the encoder and decoder • Experiment result • Compared with several baseline systems • Achieve significant improvements • Reveal that it deserves to investigate the order information within self-attention 2018/8/2  NLP 4

Slide 5

Slide 5 text

Model Architecture • Based on the AE and GAN • Local discriminator : multi-layer perceptron • Global discriminator : based on CNN 2018/8/2  NLP 5

Slide 6

Slide 6 text

Model Architecture • Weight-sharing constraint • Based on the shared-latent space assumption • Share the weights of the last few layers of encoder • Extracting high-level representations of the input sentences • Share the first few layers of the decoder • Decode high-level representations 2018/8/2  NLP 6

Slide 7

Slide 7 text

Model Architecture • Embedding reinforced encoder • Pre-trained cross-lingual embeddingsthat are kept fixed during training • The final output sequence of the encoder computed !" = $ ⊙ ! + 1 − $ ⊙ * E : input sequence embedding vectors H : initial output sequence of the encoder stack g : gate unit and compute as $ = +(-. * + -/ ! + 0) • -. , -/ 345 0 are trainable parameters and they are shared by the two encoders. 2018/8/2  NLP 7

Slide 8

Slide 8 text

Methodology • Back-translation • Utilize for the cross language training • How to get the pseudo-parallel corpus • source / target sentence à target / source sentence • Utilize the pseudo-parallel corpus to reconstruct the original sentence from its translation 2018/8/2  NLP 8

Slide 9

Slide 9 text

Methodology • Local GAN • To further enforce the shared-latent space, train a discriminative neural network • Takes the output of the encoder and produces a binary prediction about the input sentence • Local discriminator is trained to predict the exact language • Encoders are trained to fool the local disctiminator 2018/8/2  NLP 9

Slide 10

Slide 10 text

Methodology • Global GAN • Fine tune the whole model • Utilized to update the whole parameters of the proposed model 2018/8/2  NLP 10

Slide 11

Slide 11 text

Methodology • Training 1. Train with AE, back-translation and the local GANs 2. No improvement is achieved on the development set 3. Fine tune the proposed model with the global GANs 2018/8/2  NLP 11

Slide 12

Slide 12 text

Evaluation • Evaluated by computing the BLEU score • Two-step translation process • Translate the source sentences to the target language • The resulting sentences back to the source language • Performance is finally averaged over two directions 2018/8/2  NLP 12

Slide 13

Slide 13 text

Experiment Result(1) 2018/8/2  NLP 13 • Vary the number of weight-sharing layers in the AEs • Verifies that the shared encoder is detrimental to the performance especially distant language pairs

Slide 14

Slide 14 text

Experiment Result(2) 2018/8/2  NLP 14 • Only trained with monolingual data effectively learns to use the context information and the internal structure of each language

Slide 15

Slide 15 text

Experiment Result(3) • The most critical component is the weight-sharing constraint • The embedding-reinforced encoder brings some improvement on all of the translation tasks 2018/8/2  NLP 15

Slide 16

Slide 16 text

Experiment Result(4) • Remove the directional self-attention à -0.3 BLEU • Deserves more efforts to investigate the temporal order information • The GANs significantly improve the performance 2018/8/2  NLP 16

Slide 17

Slide 17 text

Conclusion • They proposed • The weight-sharing constraint in unsupervised NMT • The embedding-reinforced encoders • Local GAN and global GAN • Achieves significant improvement • Reveals that the shared encoder is really a bottleneck • Future work • Investigate how to utilize the monolingual data more effectively • Explore how to reinforce the temporal order information 2018/8/2  NLP 17