KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Slide 1

Slide 1 text

Kohei Shinden, Atsuki Maruta, Makoto P. Kato University of Tsukuba KASYS at the NTCIR-15 WWW-3 Task

Slide 2

Slide 2 text

• NTCIR-15 WWW-3 Task ‒ Ad-hoc document retrieval tasks for web documents Background 2 • Proposed search model using BERT (Birch) ‒ Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019 ‒ BERT has been successfully applied to a broad range of NLP tasks including document ranking tasks.

Slide 3

Slide 3 text

• Applying a sentence-level relevance estimator learned by QA and microblog search datasets to ad-hoc document retrieval Birch (Yilmaz et al, 2019) 3 1. The sentence-level relevance estimator is obtained by fine-tuning the pre-trained BERT model with QA and microblog search data. 2. Calculate BM25 scores and BERT scores for query and document sentences. 3. Weighted sum of the BM25 and the score of the highest BERT-score sentence in the document. Pre-trained BERT Model BERT Sentence-Level Relevance Judgements Model Halloween Pictures Datasets Trick or Treat... 0.7 Children get candy... 0.3 Pumpkin sweets... 0.1 0.4 BERT + BM25 = 0.6 BM25 Score BERT Score Sentences Document Fine-tune

Slide 4

Slide 4 text

• Weighted sum of the BM25 and the score of the highest BERT-scoring sentence in the document ‒ Assuming that the most relevant sentences in a document are good indicators of the document-level relevance [1] • 𝑓BM25 (𝑑): The BM25 score of document 𝑑 • 𝑓BERT (𝑝𝑖 ): The sentence relevance of the top 𝑖-th sentence obtained by BERT • 𝑤𝑖 : The hyper-parameter 𝑤𝑖 is to be tuned with a validation set Details of Birch 4 [1] Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019

Slide 5

Slide 5 text

Preliminary Experiment Details 5 • Preliminary experiments to select datasets and hyper-parameters suitable for ranking web documents Train Validation NTCIR-14 WWW-2 Test Collection (with its original qrels) Robust04 MS MARCO TREC CAR TREC MB Model MB ✓ ✓ Model CAR ✓ ✓ Model MS MARCO ✓ ✓ Model CAR → MB ✓ ✓ ✓ Model MS MARCO → MB ✓ ✓ ✓ The checkmarks represent the data set used for training.

Slide 6

Slide 6 text

MSMARCO → MB is the best. Thus, we submitted runs based on MS MARCO → MB and CAR → MB. Preliminary Experiment Results & Discussion 6 • Evaluated the prediction results of Birch models ‒ Top k sentences: Uses the k-sentence with the highest BERT score for ranking 0.3098 0.3112 0.3103 0.3266 0.3312 0.3318 0 0.1 0.2 0.3 0.4 0.5 BM25 MB CAR MS MARCO CAR → MB MS MARCO → MB nDCG@10 Baseline Top 1 sentence Top 2 sentences Top 3 sentences

Slide 7

Slide 7 text

• MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • The reason why MS MARCO and TREC CARʻs results are better probably because they are web documents retrieval and have a large amount of data. • BERT is also valid for web document retrieval. Official Evaluation Results & Discussion 7 • Achieved the best performances in terms of nDCG, Q and iRBU among all the participants. KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

Slide 8

Slide 8 text

• Achieved the best performances in terms of nDCG, Q and iRBU among all the participants. • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. Summary of NEW Runs 8 KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

Slide 9

Slide 9 text

REP Runs 9

Slide 10

Slide 10 text

Replicating and reproducing the THUIR runs at the NTCIR 14 WWW-2 Task Whether the results between models are consistent with each result. THUIR KASYS(ours) Abstract of REP runs 10 BM25 BM25 LambdaMART (learning-to-rank model) LambdaMART (learning-to-rank model) < < ❓

Slide 11

Slide 11 text

Replication Procedure 1 11 disney switch Canon ‥‥ Clueweb Collection Ranked by BM25 algorithm input output Disney shop Tokyo Disney resort Disney official ‥‥ Ranked web documents 1st 2nd 3rd input Feature extracting program Extracted eight features Extracting tf, idf, docement length, BM25, LMIR as features Up to BM25 LamdbaMART from here WWW-2 and WWW-3 topics honda Pokemon ice age ‥‥

Slide 12

Slide 12 text

・MQ Track : A dataset of the relevance of a topic and a document. Replication Procedure 2 12 Re-ranked web document Extraction feature program qid:001 1:0.2 ‥ qid:001 1:0.5 ‥ qid:001 1:0.1 ‥ qid:001 1:0.9 ‥ output ‥‥ Extracted features from document LambdaMART input MQ Track WWW-1 test collection train validate Disney official Disney shop Tokyo Disney resort 1st 2nd 3rd ‥‥ output

Slide 13

Slide 13 text

• Features for learning to rank ‒ TF, IDF, TF-IDF, document length, BM25 score, and three language-model-based IR scores • The differences from original paper ‒ Although THUIR extracted the features from four fields (whole document, anchor text, title, and URL), we extracted the features from only the whole document ‒ Normalization is used by maximum and minimum values because the normalization of features was not described in the original paper Implementation Details 13

Slide 14

Slide 14 text

0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 LamdbaMART BM25 Ours Original 0.3 0.31 0.32 0.33 0.34 0.35 0.36 Preliminary Evaluation Results with Original WWW-2 qrels 14 0.28 0.29 0.3 0.31 0.32 0.33 0.34 Ours Original nDCG@10 Q@10 nERR@10 • Our results is lower than original results • LambdaMART results were above BM25 for all evaluation metrics • Succeeded in reproducing the run Ours Original

Slide 15

Slide 15 text

Official Evaluation Results 15 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 nDCG Q ERR iRBU WWW-3 official result LambdaMART BM25 • BM25 results were above LambdaMART for all evaluation metrics • Failed to reproduce the run 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 nDCG Q ERR iRBU WWW-2 official result LambdaMART BM25

Slide 16

Slide 16 text

• In the original paper, LambdaMART gave better results than BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Conclusion 16 Suggestions • In web search tasks, more effective to extract features from all fields • Better to clarify the method of normalization in a paper

Slide 17

Slide 17 text

NEW runs • Achieved the best performances in terms of nDCG, Q and iRBU among all the participants • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. REP runs • In the original paper, LambdaMART gave better results than BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Summary of All Runs 17