Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Shotaro Ishihara and Hono Shirai. 2022. Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1208–1214, Seattle, United States. Association for Computational Linguistics.
https://aclanthology.org/2022.semeval-1.171/

Shotaro Ishihara

July 15, 2022
Tweet

More Decks by Shotaro Ishihara

Other Decks in Research

Transcript

  1. Shotaro Ishihara, Hono Shirai (Nikkei, Inc.) shotaro.ishihara@nex.nikkei.com The 16th International

    Workshop on Semantic Evaluation Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity
  2. Overview 2 • This paper presents our exploration of BERT-based

    Bi-Encoder approach for predicting the similarity of two multilingual news. • There are several findings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. • The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12.
  3. • Introduction • System Overview & Experimental Results ◦ RQ0:

    Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 3
  4. Task 8: Multilingual news article similarity 4 • Given two

    news articles, predict the topics similarity (Chen et al., 2022). ◦ input: headline and body ◦ output: score from 1-4 ◦ Eight language pairs in the training dataset. ◦ Additional ten language pairs appear in the evaluation dataset.
  5. • Cross-Encoder: which inputs pairs of texts into a single

    encoder. • Bi-Encoder: which encodes each input independently. • Cross-Encoder is standard for supervised learning approach. (Lin et al., 2021; Reimers and Gurevych, 2019) • Still, it is important to try both types of architecture in search for the high performance. Cross-Encoder vs Bi-Encoder 5
  6. Research question (RQ) 6 • RQ0: Cross-Encoder vs Bi-Encoder? •

    RQ1: Which pretrained model works well? • RQ2: What kind of pooling method is proper? • RQ3: Is it useful for translating the other language into English? • RQ4: Is there some effect of data splitting and max length?
  7. • Introduction • System Overview & Experimental Results ◦ RQ0:

    Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating into English? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 7
  8. System overview 8 bert-base-multilingual-uncased, 5 folds Final model overview bert-base-multilingual-cased,

    5 folds translation & bert-base-cased, 5 folds bert-base-multilingual-cased, 20 folds × 0.3 × 0.2 × 0.2 × 0.3 weighted average ensemble title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v Base architecture last 4 [CLS] features title B + body B fully connected score
  9. • We compared two architectures. RQ0: Cross-Encoder vs Bi-Encoder? 9

    Cross-Encoder title A title B [SEP] BERT pooling fully connected score body A body B [SEP] BERT pooling features
  10. RQ0: Experimental results 10 Bi-Encoder worked better.

  11. • We considered three BERT models. ◦ bert-base-multilingual-uncased ◦ bert-base-multilingual-cased

    ◦ xlm-roberta-base RQ1: Which pretrained model works well? 11 title A + body A ? u ? last 4 [CLS] v | u - v | u * v last 4 [CLS] features title B + body B fully connected score
  12. RQ1: Experimental results 12 We used all of them for

    the final submission.
  13. • CLS: Concatenate the last four representations of CLS token.

    • CNN: Use the convolutional neural network to extract sentence vectors. • LSTM: Use the long short-term memory for extracting sentence vectors. • MAX: Use max-pooling to extract sentence vectors. RQ2: What kind of pooling method is proper? 13 title A + body A BERT ? u BERT ? v | u - v | u * v features title B + body B fully connected score
  14. RQ2: Experimental results 14 CLS outperformed the other three methods.

  15. • We examined a method of translating all datasets into

    English and using pretrained models in English. ◦ Googletrans for the translation ◦ bert-base-cased as a pretrained model. RQ3: Is it useful for translating into English? 15 title B + body B title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v last 4 [CLS] features fully connected score
  16. RQ3: Experimental results 16 The translation approach did not improve

    the performance of the multilingual models.
  17. RQ4: Is there some effect of data splitting and max

    length? 17 • Data splitting ◦ The number of data partitions in cross validation affects the number of available training samples. • Max length ◦ News articles contain important information early in the article, so there is a possibility that a smaller max length works well.
  18. RQ4: Experimental results 18 The large number of data splitting

    are good.
  19. RQ4: Experimental results 19 It was observed that the performance

    was getting worse as the max length was decreased.
  20. Weighted Average Ensemble 20

  21. • Introduction • System Overview & Experimental Results ◦ RQ0:

    Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 21
  22. Conclusion 22 • This paper presents our exploration of BERT-based

    Bi-Encoder approach for predicting the similarity of two multilingual news. • There are several findings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. • The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12. • https://github.com/upura/semeval2022-task8-multilingual-news- article-similarity