Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Shotaro Ishihara, Hono Shirai (Nikkei, Inc.) [email protected] The 16th International
Workshop on Semantic Evaluation Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Overview 2 • This paper presents our exploration of BERT-based
Bi-Encoder approach for predicting the similarity of two multilingual news. • There are several ﬁndings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. • The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12.

• Introduction • System Overview & Experimental Results ◦ RQ0:
Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 3

Task 8: Multilingual news article similarity 4 • Given two
news articles, predict the topics similarity (Chen et al., 2022). ◦ input: headline and body ◦ output: score from 1-4 ◦ Eight language pairs in the training dataset. ◦ Additional ten language pairs appear in the evaluation dataset.

• Cross-Encoder: which inputs pairs of texts into a single
encoder. • Bi-Encoder: which encodes each input independently. • Cross-Encoder is standard for supervised learning approach. (Lin et al., 2021; Reimers and Gurevych, 2019) • Still, it is important to try both types of architecture in search for the high performance. Cross-Encoder vs Bi-Encoder 5

Research question (RQ) 6 • RQ0: Cross-Encoder vs Bi-Encoder? •
RQ1: Which pretrained model works well? • RQ2: What kind of pooling method is proper? • RQ3: Is it useful for translating the other language into English? • RQ4: Is there some effect of data splitting and max length?

Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating into English? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 7

System overview 8 bert-base-multilingual-uncased, 5 folds Final model overview bert-base-multilingual-cased,
5 folds translation & bert-base-cased, 5 folds bert-base-multilingual-cased, 20 folds × 0.3 × 0.2 × 0.2 × 0.3 weighted average ensemble title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v Base architecture last 4 [CLS] features title B + body B fully connected score

• We compared two architectures. RQ0: Cross-Encoder vs Bi-Encoder? 9
Cross-Encoder title A title B [SEP] BERT pooling fully connected score body A body B [SEP] BERT pooling features

RQ0: Experimental results 10 Bi-Encoder worked better.

• We considered three BERT models. ◦ bert-base-multilingual-uncased ◦ bert-base-multilingual-cased
◦ xlm-roberta-base RQ1: Which pretrained model works well? 11 title A + body A ？ u ？ last 4 [CLS] v | u - v | u * v last 4 [CLS] features title B + body B fully connected score

RQ1: Experimental results 12 We used all of them for
the ﬁnal submission.

• CLS: Concatenate the last four representations of CLS token.
• CNN: Use the convolutional neural network to extract sentence vectors. • LSTM: Use the long short-term memory for extracting sentence vectors. • MAX: Use max-pooling to extract sentence vectors. RQ2: What kind of pooling method is proper? 13 title A + body A BERT ？ u BERT ？ v | u - v | u * v features title B + body B fully connected score

RQ2: Experimental results 14 CLS outperformed the other three methods.

• We examined a method of translating all datasets into
English and using pretrained models in English. ◦ Googletrans for the translation ◦ bert-base-cased as a pretrained model. RQ3: Is it useful for translating into English? 15 title B + body B title A + body A BERT u BERT last 4 [CLS] v | u - v | u * v last 4 [CLS] features fully connected score

RQ3: Experimental results 16 The translation approach did not improve
the performance of the multilingual models.

RQ4: Is there some effect of data splitting and max
length? 17 • Data splitting ◦ The number of data partitions in cross validation affects the number of available training samples. • Max length ◦ News articles contain important information early in the article, so there is a possibility that a smaller max length works well.

RQ4: Experimental results 18 The large number of data splitting
are good.

RQ4: Experimental results 19 It was observed that the performance
was getting worse as the max length was decreased.

Weighted Average Ensemble 20

Cross-Encoder vs Bi-Encoder? ◦ RQ1: Which pretrained model works well? ◦ RQ2: What kind of pooling method is proper? ◦ RQ3: Is it useful for translating? ◦ RQ4: Is there some effect of data splitting and max length? • Conclusion Outline 21

Conclusion 22 • This paper presents our exploration of BERT-based
Bi-Encoder approach for predicting the similarity of two multilingual news. • There are several ﬁndings such as pretrained models, pooling methods, translation, data separation, and the number of tokens. • The weighted average ensemble of the four models achieved the competitive result and ranked in the top 12. • https://github.com/upura/semeval2022-task8-multilingual-news- article-similarity

Nikkei at SemEval-2022 Task 8: Exploring BERT-b...

Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Shotaro Ishihara

More Decks by Shotaro Ishihara

Other Decks in Research

Featured

Transcript

Shotaro Ishihara, Hono Shirai (Nikkei, Inc.) [email protected] The 16th International

Overview 2 • This paper presents our exploration of BERT-based

• Introduction • System Overview & Experimental Results ◦ RQ0:

Task 8: Multilingual news article similarity 4 • Given two

• Cross-Encoder: which inputs pairs of texts into a single

Research question (RQ) 6 • RQ0: Cross-Encoder vs Bi-Encoder? •

• Introduction • System Overview & Experimental Results ◦ RQ0:

System overview 8 bert-base-multilingual-uncased, 5 folds Final model overview bert-base-multilingual-cased,

• We compared two architectures. RQ0: Cross-Encoder vs Bi-Encoder? 9

RQ0: Experimental results 10 Bi-Encoder worked better.

• We considered three BERT models. ◦ bert-base-multilingual-uncased ◦ bert-base-multilingual-cased

RQ1: Experimental results 12 We used all of them for

• CLS: Concatenate the last four representations of CLS token.

RQ2: Experimental results 14 CLS outperformed the other three methods.

• We examined a method of translating all datasets into

RQ3: Experimental results 16 The translation approach did not improve

RQ4: Is there some effect of data splitting and max

RQ4: Experimental results 18 The large number of data splitting

RQ4: Experimental results 19 It was observed that the performance

Weighted Average Ensemble 20

• Introduction • System Overview & Experimental Results ◦ RQ0:

Conclusion 22 • This paper presents our exploration of BERT-based