KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Kohei Shinden, Atsuki Maruta, Makoto P. Kato University of Tsukuba
KASYS at the NTCIR-15 WWW-3 Task

• NTCIR-15 WWW-3 Task ‒ Ad-hoc document retrieval tasks for
web documents Background 2 • Proposed search model using BERT (Birch) ‒ Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019 ‒ BERT has been successfully applied to a broad range of NLP tasks including document ranking tasks.

• Applying a sentence-level relevance estimator learned by QA and
microblog search datasets to ad-hoc document retrieval Birch (Yilmaz et al, 2019) 3 1. The sentence-level relevance estimator is obtained by fine-tuning the pre-trained BERT model with QA and microblog search data. 2. Calculate BM25 scores and BERT scores for query and document sentences. 3. Weighted sum of the BM25 and the score of the highest BERT-score sentence in the document. Pre-trained BERT Model BERT Sentence-Level Relevance Judgements Model Halloween Pictures Datasets Trick or Treat... 0.7 Children get candy... 0.3 Pumpkin sweets... 0.1 0.4 BERT + BM25 = 0.6 BM25 Score BERT Score Sentences Document Fine-tune

• Weighted sum of the BM25 and the score of
the highest BERT-scoring sentence in the document ‒ Assuming that the most relevant sentences in a document are good indicators of the document-level relevance [1] • 𝑓BM25 (𝑑): The BM25 score of document 𝑑 • 𝑓BERT (𝑝𝑖 ): The sentence relevance of the top 𝑖-th sentence obtained by BERT • 𝑤𝑖 : The hyper-parameter 𝑤𝑖 is to be tuned with a validation set Details of Birch 4 [1] Yilmaz et al: Cross-Domain Modeling of Sentence-level Evidence for Document Retrieval, EMNLP 2019

Preliminary Experiment Details 5 • Preliminary experiments to select datasets
and hyper-parameters suitable for ranking web documents Train Validation NTCIR-14 WWW-2 Test Collection (with its original qrels) Robust04 MS MARCO TREC CAR TREC MB Model MB ✓ ✓ Model CAR ✓ ✓ Model MS MARCO ✓ ✓ Model CAR → MB ✓ ✓ ✓ Model MS MARCO → MB ✓ ✓ ✓ The checkmarks represent the data set used for training.

MSMARCO → MB is the best. Thus, we submitted runs
based on MS MARCO → MB and CAR → MB. Preliminary Experiment Results & Discussion 6 • Evaluated the prediction results of Birch models ‒ Top k sentences: Uses the k-sentence with the highest BERT score for ranking 0.3098 0.3112 0.3103 0.3266 0.3312 0.3318 0 0.1 0.2 0.3 0.4 0.5 BM25 MB CAR MS MARCO CAR → MB MS MARCO → MB nDCG@10 Baseline Top 1 sentence Top 2 sentences Top 3 sentences

• MSMARCO→MB is the best. The CAR→MB model also achieved
similar scores. • The reason why MS MARCO and TREC CARʻs results are better probably because they are web documents retrieval and have a large amount of data. • BERT is also valid for web document retrieval. Official Evaluation Results & Discussion 7 • Achieved the best performances in terms of nDCG, Q and iRBU among all the participants. KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

• Achieved the best performances in terms of nDCG, Q
and iRBU among all the participants. • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. Summary of NEW Runs 8 KASYS-E-CO-NEW-1: - MS MARCO→MB - Top 3 sentences KASYS-E-CO-NEW-5: - CAR→MB - Top 3 sentences KASYS-E-CO-NEW-4: - MS MARCO→MB - Top 2 sentences 0.6935 0.7123 0.7959 0.9389 0 0.2 0.4 0.6 0.8 1 nDCG Q ERR iRBU Baseline KASYS-E-CO-NEW-1 KASYS-E-CO-NEW-4 KASYS-E-CO-NEW-5

REP Runs 9

Replicating and reproducing the THUIR runs at the NTCIR 14
WWW-2 Task Whether the results between models are consistent with each result. THUIR KASYS(ours) Abstract of REP runs 10 BM25 BM25 LambdaMART (learning-to-rank model) LambdaMART (learning-to-rank model) < < ❓

Replication Procedure 1 11 disney switch Canon ‥‥ Clueweb Collection
Ranked by BM25 algorithm input output Disney shop Tokyo Disney resort Disney official ‥‥ Ranked web documents 1st 2nd 3rd input Feature extracting program Extracted eight features Extracting tf, idf, docement length, BM25, LMIR as features Up to BM25 LamdbaMART from here WWW-2 and WWW-3 topics honda Pokemon ice age ‥‥

・MQ Track : A dataset of the relevance of a
topic and a document. Replication Procedure 2 12 Re-ranked web document Extraction feature program qid:001 1:0.2 ‥ qid:001 1:0.5 ‥ qid:001 1:0.1 ‥ qid:001 1:0.9 ‥ output ‥‥ Extracted features from document LambdaMART input MQ Track WWW-1 test collection train validate Disney official Disney shop Tokyo Disney resort 1st 2nd 3rd ‥‥ output

• Features for learning to rank ‒ TF, IDF, TF-IDF,
document length, BM25 score, and three language-model-based IR scores • The differences from original paper ‒ Although THUIR extracted the features from four fields (whole document, anchor text, title, and URL), we extracted the features from only the whole document ‒ Normalization is used by maximum and minimum values because the normalization of features was not described in the original paper Implementation Details 13

0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 LamdbaMART
BM25 Ours Original 0.3 0.31 0.32 0.33 0.34 0.35 0.36 Preliminary Evaluation Results with Original WWW-2 qrels 14 0.28 0.29 0.3 0.31 0.32 0.33 0.34 Ours Original nDCG@10 Q@10 nERR@10 • Our results is lower than original results • LambdaMART results were above BM25 for all evaluation metrics • Succeeded in reproducing the run Ours Original

Official Evaluation Results 15 0.6 0.65 0.7 0.75 0.8 0.85
0.9 0.95 nDCG Q ERR iRBU WWW-3 official result LambdaMART BM25 • BM25 results were above LambdaMART for all evaluation metrics • Failed to reproduce the run 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 nDCG Q ERR iRBU WWW-2 official result LambdaMART BM25

• In the original paper, LambdaMART gave better results than
BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Conclusion 16 Suggestions • In web search tasks, more effective to extract features from all fields • Better to clarify the method of normalization in a paper

NEW runs • Achieved the best performances in terms of
nDCG, Q and iRBU among all the participants • The effectiveness of BERT in ad hoc web document retrieval tasks was verified. • MSMARCO→MB is the best. The CAR→MB model also achieved similar scores. • BERT is also valid for web document retrieval. REP runs • In the original paper, LambdaMART gave better results than BM25, but on the contrary, our BM25 result was better than LambdaMART • We failed to replicate and reproduce the original paper Summary of All Runs 17

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCI...

KASYS at the NTCIR-15 WWW-3 Task /KASYS-at-NTCIR-15-WWW-3

Kohei Shinden

More Decks by Kohei Shinden

Other Decks in Research

Featured

Transcript

Kohei Shinden, Atsuki Maruta, Makoto P. Kato University of Tsukuba

• NTCIR-15 WWW-3 Task ‒ Ad-hoc document retrieval tasks for

• Applying a sentence-level relevance estimator learned by QA and

• Weighted sum of the BM25 and the score of

Preliminary Experiment Details 5 • Preliminary experiments to select datasets

MSMARCO → MB is the best. Thus, we submitted runs

• MSMARCO→MB is the best. The CAR→MB model also achieved

• Achieved the best performances in terms of nDCG, Q

REP Runs 9

Replicating and reproducing the THUIR runs at the NTCIR 14

Replication Procedure 1 11 disney switch Canon ‥‥ Clueweb Collection

・MQ Track : A dataset of the relevance of a

• Features for learning to rank ‒ TF, IDF, TF-IDF,

0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 LamdbaMART

Official Evaluation Results 15 0.6 0.65 0.7 0.75 0.8 0.85

• In the original paper, LambdaMART gave better results than

NEW runs • Achieved the best performances in terms of