Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

by Shotaro Ishihara

Slide 1

Slide 1 text

Shotaro Ishihara, Hono Shirai (Nikkei, Inc.) [email protected] The 16th International Workshop on Semantic Evaluation Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Slide 2

Slide 2 text

Overview 2 ● This paper presents our exploration of BERT-based Bi-Encoder approach for predicting the similarity of two multilingual news. ● There are several ﬁndings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. ● The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12.

Slide 3

Slide 3 text

● Introduction ● System Overview & Experimental Results ○ RQ0: Cross-Encoder vs Bi-Encoder? ○ RQ1: Which pretrained model works well? ○ RQ2: What kind of pooling method is proper? ○ RQ3: Is it useful for translating? ○ RQ4: Is there some effect of data splitting and max length? ● Conclusion Outline 3

Slide 4

Slide 4 text

Task 8: Multilingual news article similarity 4 ● Given two news articles, predict the topics similarity (Chen et al., 2022). ○ input: headline and body ○ output: score from 1-4 ○ Eight language pairs in the training dataset. ○ Additional ten language pairs appear in the evaluation dataset.

Slide 5

Slide 5 text

● Cross-Encoder: which inputs pairs of texts into a single encoder. ● Bi-Encoder: which encodes each input independently. ● Cross-Encoder is standard for supervised learning approach. (Lin et al., 2021; Reimers and Gurevych, 2019) ● Still, it is important to try both types of architecture in search for the high performance. Cross-Encoder vs Bi-Encoder 5

Slide 6

Slide 6 text

Research question (RQ) 6 ● RQ0: Cross-Encoder vs Bi-Encoder? ● RQ1: Which pretrained model works well? ● RQ2: What kind of pooling method is proper? ● RQ3: Is it useful for translating the other language into English? ● RQ4: Is there some effect of data splitting and max length?

Slide 7

Slide 7 text

● Introduction ● System Overview & Experimental Results ○ RQ0: Cross-Encoder vs Bi-Encoder? ○ RQ1: Which pretrained model works well? ○ RQ2: What kind of pooling method is proper? ○ RQ3: Is it useful for translating into English? ○ RQ4: Is there some effect of data splitting and max length? ● Conclusion Outline 7

Slide 8

Slide 8 text

System overview 8 bert-base-multilingual-uncased, 5 folds Final model overview bert-base-multilingual-cased, 5 folds translation & bert-base-cased, 5 folds bert-base-multilingual-cased, 20 folds × 0.3 × 0.2 × 0.2 × 0.3 weighted average ensemble title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v Base architecture last 4 [CLS] features title B + body B fully connected score

Slide 9

Slide 9 text

● We compared two architectures. RQ0: Cross-Encoder vs Bi-Encoder? 9 Cross-Encoder title A title B [SEP] BERT pooling fully connected score body A body B [SEP] BERT pooling features

Slide 10

Slide 10 text

RQ0: Experimental results 10 Bi-Encoder worked better.

Slide 11

Slide 11 text

● We considered three BERT models. ○ bert-base-multilingual-uncased ○ bert-base-multilingual-cased ○ xlm-roberta-base RQ1: Which pretrained model works well? 11 title A + body A ？ u ？ last 4 [CLS] v | u - v | u * v last 4 [CLS] features title B + body B fully connected score

Slide 12

Slide 12 text

RQ1: Experimental results 12 We used all of them for the ﬁnal submission.

Slide 13

Slide 13 text

● CLS: Concatenate the last four representations of CLS token. ● CNN: Use the convolutional neural network to extract sentence vectors. ● LSTM: Use the long short-term memory for extracting sentence vectors. ● MAX: Use max-pooling to extract sentence vectors. RQ2: What kind of pooling method is proper? 13 title A + body A BERT ？ u BERT ？ v | u - v | u * v features title B + body B fully connected score

Slide 14

Slide 14 text

RQ2: Experimental results 14 CLS outperformed the other three methods.

Slide 15

Slide 15 text

● We examined a method of translating all datasets into English and using pretrained models in English. ○ Googletrans for the translation ○ bert-base-cased as a pretrained model. RQ3: Is it useful for translating into English? 15 title B + body B title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v last 4 [CLS] features fully connected score

Slide 16

Slide 16 text

RQ3: Experimental results 16 The translation approach did not improve the performance of the multilingual models.

Slide 17

Slide 17 text

RQ4: Is there some effect of data splitting and max length? 17 ● Data splitting ○ The number of data partitions in cross validation affects the number of available training samples. ● Max length ○ News articles contain important information early in the article, so there is a possibility that a smaller max length works well.

Slide 18

Slide 18 text

RQ4: Experimental results 18 The large number of data splitting are good.

Slide 19

Slide 19 text

RQ4: Experimental results 19 It was observed that the performance was getting worse as the max length was decreased.

Slide 20

Slide 20 text

Weighted Average Ensemble 20

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Conclusion 22 ● This paper presents our exploration of BERT-based Bi-Encoder approach for predicting the similarity of two multilingual news. ● There are several ﬁndings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. ● The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12. ● https://github.com/upura/semeval2022-task8-multilingual-news- article-similarity