Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Kohei Shinden
December 10, 2020

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Published on Dec 9, 2020

KASYS at the NTCIR-15 WWW-3 Task
Achieved the best performances in terms of nDCG, Q and iRBU among all the participants in the WWW-3 Task
paper: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings15/pdf/ntcir/02-NTCIR15-WWW-ShindenK.pdf

Kohei Shinden

December 10, 2020
Tweet

More Decks by Kohei Shinden

Other Decks in Research

Transcript

  1. Kohei Shinden, Atsuki Maruta, Makoto P. Kato
    University of Tsukuba
    KASYS at the NTCIR-15 WWW-3 Task

    View Slide

  2. • NTCIR-15 WWW-3 Task
    ‒ Ad-hoc document retrieval tasks for web documents
    Background 2
    • Proposed search model using BERT (Birch)
    ‒ Yilmaz et al: Cross-Domain Modeling of Sentence-level
    Evidence for Document Retrieval, EMNLP 2019
    ‒ BERT has been successfully applied to a broad range of NLP
    tasks including document ranking tasks.

    View Slide

  3. • Applying a sentence-level relevance estimator learned by QA and
    microblog search datasets to ad-hoc document retrieval
    Birch (Yilmaz et al, 2019) 3
    1. The sentence-level relevance estimator is obtained by fine-tuning the
    pre-trained BERT model with QA and microblog search data.
    2. Calculate BM25 scores and BERT scores for query and document sentences.
    3. Weighted sum of the BM25 and the score of the highest BERT-score
    sentence in the document.
    Pre-trained
    BERT Model
    BERT
    Sentence-Level
    Relevance Judgements
    Model
    Halloween Pictures
    Datasets
    Trick or Treat...
    0.7
    Children get candy...
    0.3
    Pumpkin sweets...
    0.1
    0.4
    BERT + BM25 = 0.6
    BM25
    Score
    BERT
    Score Sentences Document
    Fine-tune

    View Slide

  4. • Weighted sum of the BM25 and the score of the highest
    BERT-scoring sentence in the document
    ‒ Assuming that the most relevant sentences in a document are
    good indicators of the document-level relevance [1]
    • 𝑓BM25
    (𝑑): The BM25 score of document 𝑑
    • 𝑓BERT
    (𝑝𝑖
    ): The sentence relevance of the top 𝑖-th sentence obtained by BERT
    • 𝑤𝑖
    : The hyper-parameter 𝑤𝑖
    is to be tuned with a validation set
    Details of Birch 4
    [1] Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019

    View Slide

  5. Preliminary Experiment Details 5
    • Preliminary experiments to select datasets and
    hyper-parameters suitable for ranking web documents
    Train Validation
    NTCIR-14 WWW-2
    Test Collection
    (with its original qrels)
    Robust04 MS MARCO TREC CAR TREC MB
    Model
    MB
    ✓ ✓
    Model
    CAR
    ✓ ✓
    Model
    MS MARCO
    ✓ ✓
    Model
    CAR → MB
    ✓ ✓ ✓
    Model
    MS MARCO → MB
    ✓ ✓ ✓
    The checkmarks represent the data set used for training.

    View Slide

  6. MSMARCO → MB is the best.
    Thus, we submitted runs based on
    MS MARCO → MB and CAR → MB.
    Preliminary Experiment Results & Discussion 6
    • Evaluated the prediction results of Birch models
    ‒ Top k sentences: Uses the k-sentence with the highest BERT score for ranking
    0.3098 0.3112 0.3103
    0.3266 0.3312
    0.3318
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    BM25 MB CAR MS MARCO CAR → MB MS MARCO → MB
    [email protected]
    Baseline Top 1 sentence Top 2 sentences Top 3 sentences

    View Slide

  7. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores.
    • The reason why MS MARCO and TREC CARʻs results are better probably
    because they are web documents retrieval and have a large amount of data.
    • BERT is also valid for web document retrieval.
    Official Evaluation Results & Discussion 7
    • Achieved the best performances in terms of
    nDCG, Q and iRBU among all the participants.
    KASYS-E-CO-NEW-1:
    - MS MARCO→MB
    - Top 3 sentences
    KASYS-E-CO-NEW-4:
    - MS MARCO→MB
    - Top 2 sentences
    KASYS-E-CO-NEW-5:
    - CAR→MB
    - Top 3 sentences
    0.6935 0.7123 0.7959
    0.9389
    0
    0.2
    0.4
    0.6
    0.8
    1
    nDCG Q ERR iRBU
    Baseline KASYS-E-CO-NEW-1
    KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

    View Slide

  8. • Achieved the best performances in terms of
    nDCG, Q and iRBU among all the participants.
    • The effectiveness of BERT in ad hoc web document
    retrieval tasks was verified.
    • MSMARCO→MB is the best.
    The CAR→MB model also
    achieved similar scores.
    • BERT is also valid for
    web document retrieval.
    Summary of NEW Runs 8
    KASYS-E-CO-NEW-1:
    - MS MARCO→MB
    - Top 3 sentences
    KASYS-E-CO-NEW-5:
    - CAR→MB
    - Top 3 sentences
    KASYS-E-CO-NEW-4:
    - MS MARCO→MB
    - Top 2 sentences
    0.6935 0.7123 0.7959
    0.9389
    0
    0.2
    0.4
    0.6
    0.8
    1
    nDCG Q ERR iRBU
    Baseline KASYS-E-CO-NEW-1
    KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

    View Slide

  9. REP Runs
    9

    View Slide

  10. Replicating and reproducing the THUIR runs
    at the NTCIR 14 WWW-2 Task
    Whether the results between models are consistent with each result.
    THUIR KASYS(ours)
    Abstract of REP runs 10
    BM25 BM25
    LambdaMART
    (learning-to-rank model)
    LambdaMART
    (learning-to-rank model)
    <
    <

    View Slide

  11. Replication Procedure 1 11
    disney
    switch
    Canon
    ‥‥
    Clueweb
    Collection
    Ranked by
    BM25
    algorithm
    input output
    Disney shop
    Tokyo Disney
    resort
    Disney
    official
    ‥‥
    Ranked web documents
    1st
    2nd
    3rd
    input
    Feature
    extracting
    program
    Extracted eight features
    Extracting tf, idf,
    docement length, BM25,
    LMIR as features
    Up to BM25
    LamdbaMART from here
    WWW-2 and WWW-3 topics
    honda
    Pokemon
    ice age
    ‥‥

    View Slide

  12. ・MQ Track : A dataset of the relevance of a topic and a document.
    Replication Procedure 2 12
    Re-ranked web document
    Extraction
    feature
    program
    qid:001 1:0.2 ‥
    qid:001 1:0.5 ‥
    qid:001 1:0.1 ‥
    qid:001 1:0.9 ‥
    output
    ‥‥
    Extracted features from document
    LambdaMART
    input
    MQ Track WWW-1 test
    collection
    train validate Disney
    official
    Disney shop
    Tokyo Disney
    resort
    1st
    2nd
    3rd
    ‥‥
    output

    View Slide

  13. • Features for learning to rank
    ‒ TF, IDF, TF-IDF, document length, BM25 score, and three
    language-model-based IR scores
    • The differences from original paper
    ‒ Although THUIR extracted the features from four fields (whole
    document, anchor text, title, and URL), we extracted the features
    from only the whole document
    ‒ Normalization is used by maximum and minimum values because
    the normalization of features was not described in the original
    paper
    Implementation Details 13

    View Slide

  14. 0.43
    0.44
    0.45
    0.46
    0.47
    0.48
    0.49
    0.5
    0.51
    LamdbaMART BM25
    Ours Original
    0.3
    0.31
    0.32
    0.33
    0.34
    0.35
    0.36
    Preliminary Evaluation Results with Original WWW-2 qrels 14
    0.28
    0.29
    0.3
    0.31
    0.32
    0.33
    0.34
    Ours Original
    [email protected] [email protected] [email protected]
    • Our results is lower than original results
    • LambdaMART results were above BM25 for all evaluation metrics
    • Succeeded in reproducing the run
    Ours Original

    View Slide

  15. Official Evaluation Results 15
    0.6
    0.65
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    nDCG Q ERR iRBU
    WWW-3 official result
    LambdaMART BM25
    • BM25 results were above LambdaMART for all evaluation metrics
    • Failed to reproduce the run
    0.5
    0.55
    0.6
    0.65
    0.7
    0.75
    0.8
    0.85
    0.9
    0.95
    nDCG Q ERR iRBU
    WWW-2 official result
    LambdaMART BM25

    View Slide

  16. • In the original paper, LambdaMART gave better results than
    BM25, but on the contrary, our BM25 result was better than
    LambdaMART
    • We failed to replicate and reproduce the original paper
    Conclusion 16
    Suggestions
    • In web search tasks, more effective to extract features from
    all fields
    • Better to clarify the method of normalization in a paper

    View Slide

  17. NEW runs
    • Achieved the best performances in terms of nDCG, Q and iRBU among
    all the participants
    • The effectiveness of BERT in ad hoc web document retrieval tasks
    was verified.
    • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores.
    • BERT is also valid for web document retrieval.
    REP runs
    • In the original paper, LambdaMART gave better results than BM25,
    but on the contrary, our BM25 result was better than LambdaMART
    • We failed to replicate and reproduce the original paper
    Summary of All Runs 17

    View Slide