Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity

Shotaro Ishihara and Hono Shirai. 2022. Nikkei at SemEval-2022 Task 8: Exploring BERT-based Bi-Encoder Approach for Pairwise Multilingual News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 1208–1214, Seattle, United States. Association for Computational Linguistics.
https://aclanthology.org/2022.semeval-1.171/

Shotaro Ishihara

July 15, 2022
Tweet

More Decks by Shotaro Ishihara

Other Decks in Research

Transcript

  1. Shotaro Ishihara, Hono Shirai (Nikkei, Inc.)
    [email protected]
    The 16th International Workshop on Semantic Evaluation
    Nikkei at SemEval-2022 Task 8:
    Exploring BERT-based Bi-Encoder
    Approach for Pairwise Multilingual
    News Article Similarity

    View Slide

  2. Overview
    2
    ● This paper presents our exploration of BERT-based Bi-Encoder
    approach for predicting the similarity of two multilingual news.
    ● There are several findings such as pretrained models, pooling
    methods, translation, data separation, and the number of tokens.
    ● The weighted average ensemble of the four models achieved the
    competitive result and ranked in the top 12.

    View Slide

  3. ● Introduction
    ● System Overview & Experimental Results
    ○ RQ0: Cross-Encoder vs Bi-Encoder?
    ○ RQ1: Which pretrained model works well?
    ○ RQ2: What kind of pooling method is proper?
    ○ RQ3: Is it useful for translating?
    ○ RQ4: Is there some effect of data splitting and max length?
    ● Conclusion
    Outline
    3

    View Slide

  4. Task 8: Multilingual news article similarity
    4
    ● Given two news articles, predict the
    topics similarity (Chen et al., 2022).
    ○ input: headline and body
    ○ output: score from 1-4
    ○ Eight language pairs in the
    training dataset.
    ○ Additional ten language pairs
    appear in the evaluation dataset.

    View Slide

  5. ● Cross-Encoder: which inputs pairs of texts into a single encoder.
    ● Bi-Encoder: which encodes each input independently.
    ● Cross-Encoder is standard for
    supervised learning approach.
    (Lin et al., 2021; Reimers and Gurevych, 2019)
    ● Still, it is important to try both
    types of architecture in search
    for the high performance.
    Cross-Encoder vs Bi-Encoder
    5

    View Slide

  6. Research question (RQ)
    6
    ● RQ0: Cross-Encoder vs Bi-Encoder?
    ● RQ1: Which pretrained model works well?
    ● RQ2: What kind of pooling method is proper?
    ● RQ3: Is it useful for translating the other language into English?
    ● RQ4: Is there some effect of data splitting and max length?

    View Slide

  7. ● Introduction
    ● System Overview & Experimental Results
    ○ RQ0: Cross-Encoder vs Bi-Encoder?
    ○ RQ1: Which pretrained model works well?
    ○ RQ2: What kind of pooling method is proper?
    ○ RQ3: Is it useful for translating into English?
    ○ RQ4: Is there some effect of data splitting and max length?
    ● Conclusion
    Outline
    7

    View Slide

  8. System overview
    8
    bert-base-multilingual-uncased, 5 folds
    Final model overview
    bert-base-multilingual-cased, 5 folds
    translation & bert-base-cased, 5 folds
    bert-base-multilingual-cased, 20 folds
    × 0.3
    × 0.2
    × 0.2
    × 0.3
    weighted average ensemble
    title A + body A
    BERT
    u
    BERT
    last 4 [CLS]
    v
    | u - v | u * v
    Base architecture
    last 4 [CLS]
    features
    title B + body B
    fully connected
    score

    View Slide

  9. ● We compared two architectures.
    RQ0: Cross-Encoder vs Bi-Encoder?
    9
    Cross-Encoder
    title A title B
    [SEP]
    BERT
    pooling
    fully connected
    score
    body A body B
    [SEP]
    BERT
    pooling features

    View Slide

  10. RQ0: Experimental results
    10
    Bi-Encoder worked better.

    View Slide

  11. ● We considered three BERT models.
    ○ bert-base-multilingual-uncased
    ○ bert-base-multilingual-cased
    ○ xlm-roberta-base
    RQ1: Which pretrained model works well?
    11
    title A + body A

    u

    last 4 [CLS]
    v
    | u - v | u * v
    last 4 [CLS]
    features
    title B + body B
    fully connected
    score

    View Slide

  12. RQ1: Experimental results
    12
    We used all of them for the final submission.

    View Slide

  13. ● CLS: Concatenate the last four
    representations of CLS token.
    ● CNN: Use the convolutional neural
    network to extract sentence vectors.
    ● LSTM: Use the long short-term
    memory for extracting sentence
    vectors.
    ● MAX: Use max-pooling to extract
    sentence vectors.
    RQ2: What kind of pooling method is proper?
    13
    title A + body A
    BERT

    u
    BERT

    v
    | u - v | u * v features
    title B + body B
    fully connected
    score

    View Slide

  14. RQ2: Experimental results
    14
    CLS outperformed the other three methods.

    View Slide

  15. ● We examined a method of translating
    all datasets into English and using
    pretrained models in English.
    ○ Googletrans for the translation
    ○ bert-base-cased as a pretrained
    model.
    RQ3: Is it useful for translating into English?
    15
    title B + body B
    title A + body A
    BERT
    u
    BERT
    last 4 [CLS]
    v
    | u - v | u * v
    last 4 [CLS]
    features
    fully connected
    score

    View Slide

  16. RQ3: Experimental results
    16
    The translation approach did not improve the
    performance of the multilingual models.

    View Slide

  17. RQ4: Is there some effect of data splitting and
    max length?
    17
    ● Data splitting
    ○ The number of data partitions in cross validation affects the
    number of available training samples.
    ● Max length
    ○ News articles contain important information early in the
    article, so there is a possibility that a smaller max length
    works well.

    View Slide

  18. RQ4: Experimental results
    18
    The large number of data splitting are good.

    View Slide

  19. RQ4: Experimental results
    19
    It was observed that the performance was getting
    worse as the max length was decreased.

    View Slide

  20. Weighted Average Ensemble
    20

    View Slide

  21. ● Introduction
    ● System Overview & Experimental Results
    ○ RQ0: Cross-Encoder vs Bi-Encoder?
    ○ RQ1: Which pretrained model works well?
    ○ RQ2: What kind of pooling method is proper?
    ○ RQ3: Is it useful for translating?
    ○ RQ4: Is there some effect of data splitting and max length?
    ● Conclusion
    Outline
    21

    View Slide

  22. Conclusion
    22
    ● This paper presents our exploration of BERT-based Bi-Encoder
    approach for predicting the similarity of two multilingual news.
    ● There are several findings such as pretrained models, pooling
    methods, translation, data separation, and the number of tokens.
    ● The weighted average ensemble of the four models achieved the
    competitive result and ranked in the top 12.
    ● https://github.com/upura/semeval2022-task8-multilingual-news-
    article-similarity

    View Slide