2019EMNLP読み会_Unicoder_A_Universal_Language_Encoder_by_Pre-training_with_Multiple_Cross-lingual_Tasks

Unicoder : A Universal Language Encoder by Pre-training with Multiple
Cross-lingual Tasks Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Ming Zhou EMNLP2019 紹介者 : ⼭下郁海 (TMU B4 ⼩町研究室) 2020/01/20 @EMNLP2019読み会

Overview (contributions) • Universal Language Encoder (Unicoder) Ø Based on
multilingual BERT and XLM. Ø Three new cross-lingual pre-trained tasks are proposed. Ø On the XNLI dataset, new SOTA results are achieved • A cross-lingual question answering (XQA) datasets is build. Ø This can be used as a new cross-lingual benchmark datasets. • They verify that by fine-tuning multiple languages together, significant improvements can be obtained.

Related work • Monolingual pre-training Ø Pre-trained encoder by language
model and machine translation • Cross-lingual pre-training Ø Multilingual BERT Ø XLM (MLM, TLM)

Pre-trained tasks in Unicoder • Masked language model (MLM) •
Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling

Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling This task takes two sentences from different languages as input and classifiers whether they are with the same meaning.

Translation language model (TLM) • Cross-lingual word recovery • Cross-lingual paraphrase classification • Cross-lingual masked language modeling Input is come from cross-lingual document. Cross-lingual document is truncated by 256 sequence length.

Multi-language Fine tuning Only one language has training data, but
the test is conducted on other languages. • TRANSLATE-TRAIN • TRANSLATE-TEST • Multi-language Fine-tuning

Datasets • Wikipedia (15 languages, for MLM) • MT datasets
are collected from MultiUN, II TBombay corpus, OpenSubtitles2018, EUbook-shop corpus and Gloval voices Model • 12-layer Transformer with 1024 hidden units and 16 heads Task • Cross-lingual Natural Language Inference (XNLI) • Cross-lingual Question Answering (XQA) Ø They proposed a new dataset XQA. Ø XQA contains three languages including English, French and German. Ø Only English have training data. Experiment settings

XNLI results

XQA results

Analysis • The relation between language number and fine-tuning performance
Ø The more languages they used, the better the performance. Ø English could be improved by Multi-language fine-tuning.

Analysis • The relation between English and other languages Ø
Most of the average results are improved by joint fine-tuning two languages. Only Vietnamese and Urdu lead to performance drop. Ø The improvement on English is not stable. French and Spanish could improve English performance. But Vietnamese and Thai lead to big drop.

Analysis • The relation between different languages

Summary • Unicoder with three new cross-lingual pre-trained tasks Ø
Cross-lingual word recovery Ø Cross-lingual paraphrase classification Ø Cross-lingual masked language model • The more languages they used in fine-tuning, the better the results. Ø Even rich-resource language (English) also could been improved.

2019EMNLP読み会_Unicoder_A_Universal_Language_Enco...

2019EMNLP読み会_Unicoder_A_Universal_Language_Encoder_by_Pre-training_with_Multiple_Cross-lingual_Tasks

Ikumi Yamashita

More Decks by Ikumi Yamashita

Other Decks in Technology

Featured

Transcript

Unicoder : A Universal Language Encoder by Pre-training with Multiple

Overview (contributions) • Universal Language Encoder (Unicoder) Ø Based on

Related work • Monolingual pre-training Ø Pre-trained encoder by language

Pre-trained tasks in Unicoder • Masked language model (MLM) •

Pre-trained tasks in Unicoder • Masked language model (MLM) •

Pre-trained tasks in Unicoder • Masked language model (MLM) •

Pre-trained tasks in Unicoder • Masked language model (MLM) •

Multi-language Fine tuning Only one language has training data, but

Datasets • Wikipedia (15 languages, for MLM) • MT datasets

XNLI results

XQA results

Analysis • The relation between language number and fine-tuning performance

Analysis • The relation between English and other languages Ø

Analysis • The relation between different languages

Summary • Unicoder with three new cross-lingual pre-trained tasks Ø