Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2019EMNLP読み会_Unicoder_A_Universal_Language_Encoder_by_Pre-training_with_Multiple_Cross-lingual_Tasks

 2019EMNLP読み会_Unicoder_A_Universal_Language_Encoder_by_Pre-training_with_Multiple_Cross-lingual_Tasks

Ikumi Yamashita

January 20, 2020
Tweet

More Decks by Ikumi Yamashita

Other Decks in Technology

Transcript

  1. Unicoder : A Universal Language Encoder by Pre-training with Multiple

    Cross-lingual Tasks Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Ming Zhou EMNLP2019 紹介者 : ⼭下郁海 (TMU B4 ⼩町研究室) 2020/01/20 @EMNLP2019読み会
  2. Overview (contributions) • Universal Language Encoder (Unicoder) Ø Based on

    multilingual BERT and XLM. Ø Three new cross-lingual pre-trained tasks are proposed. Ø On the XNLI dataset, new SOTA results are achieved • A cross-lingual question answering (XQA) datasets is build. Ø This can be used as a new cross-lingual benchmark datasets. • They verify that by fine-tuning multiple languages together, significant improvements can be obtained.
  3. Related work • Monolingual pre-training Ø Pre-trained encoder by language

    model and machine translation • Cross-lingual pre-training Ø Multilingual BERT Ø XLM (MLM, TLM)
  4. Pre-trained tasks in Unicoder • Masked language model (MLM) •

    Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling
  5. Pre-trained tasks in Unicoder • Masked language model (MLM) •

    Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling
  6. Pre-trained tasks in Unicoder • Masked language model (MLM) •

    Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling This task takes two sentences from different languages as input and classifiers whether they are with the same meaning.
  7. Pre-trained tasks in Unicoder • Masked language model (MLM) •

    Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling Input is come from cross-lingual document. Cross-lingual document is truncated by 256 sequence length.
  8. Multi-language Fine tuning Only one language has training data, but

    the test is conducted on other languages. • TRANSLATE-TRAIN • TRANSLATE-TEST • Multi-language Fine-tuning
  9. Datasets • Wikipedia (15 languages, for MLM) • MT datasets

    are collected from MultiUN, II TBombay corpus, OpenSubtitles2018, EUbook-shop corpus and Gloval voices Model • 12-layer Transformer with 1024 hidden units and 16 heads Task • Cross-lingual Natural Language Inference (XNLI) • Cross-lingual Question Answering (XQA) Ø They proposed a new dataset XQA. Ø XQA contains three languages including English, French and German. Ø Only English have training data. Experiment settings
  10. Analysis • The relation between language number and fine-tuning performance

    Ø The more languages they used, the better the performance. Ø English could be improved by Multi-language fine-tuning.
  11. Analysis • The relation between English and other languages Ø

    Most of the average results are improved by joint fine-tuning two languages. Only Vietnamese and Urdu lead to performance drop. Ø The improvement on English is not stable. French and Spanish could improve English performance. But Vietnamese and Thai lead to big drop.
  12. Summary • Unicoder with three new cross-lingual pre-trained tasks Ø

    Cross-lingual word recovery Ø Cross-lingual paraphrase classification Ø Cross-lingual masked language model • The more languages they used in fine-tuning, the better the results. Ø Even rich-resource language (English) also could been improved.