Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models Tao
Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, Marc Najork Paper Slides

Two First-stage Retrieval Paradigms query docs What is blood sugar
range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Lexical Exact Match Lexical retrieval model (eg, BM25) - Eﬃcient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) P 2

range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Semantic Match Lexical Exact Match Lexical retrieval model (eg, BM25) - Eﬃcient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) Deep retrieval model (eg, dual-encoder BERT) - Winner in many passage retrieval tasks - Training is expensive (computation, labeled dataset) P 3

range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Semantic Match Lexical Exact Match Lexical retrieval model (eg, BM25) - Eﬃcient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) Deep retrieval model (eg, dual-encoder BERT) - Winner in many passage retrieval tasks - Training is expensive (computation, labeled dataset) P 4 Can we apply the model to a new domain without ﬁne-tuning?

Three Research Questions - RQ 1: Can deep retrieval model
generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 5

Existing Hybrid Retrieval Models 1. Interpolation of lexical and deep
retrieval models (the most common approach, eg, [Karpukhin et al., 2020, Luan et al., 2021]) Need raw score normalization and weight tuning 2. Use RM3 (top lexical results are considered as relevance feedback) to select deep retriever results and combine them with top lexical results [Kuzi et al., 2020] 3. Combine the top results of two models in an alternative way [Zhan et al., 2020] P 6 Existing models were not evaluated in a zero-shot setup

generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? We are the ﬁrst to propose and evaluate a hybrid retrieval model in a zero-shot setup P 7

Our proposal: Non-parametric Hybrid Model - Goal: A hybrid model
can be easily applied to a new domain in a zero shot setting - A simple fusion framework based on Reciprocal Rank Fusion (RRF) Models The rank for d by model m query doc k=60 (following the original paper [Cormack et al., 2009]) - Non-parametric - Easy to fuse more than two models - Flexible to plug/unplug a model P 8

Our proposal: Using RRF to Fuse Lexical and Deep Retrieval
Models Lexical retrieval models - BM25 - BM25+RM3 - BM25+Bo1 - BM25+docT5query [Nogueira and Lin, 2019] Deep retrieval model: NPR [Lu et al., 2021] - A dual encoder BERT improved with synthetic data pre-training and hard negative sampling - Trained on MS-MARCO passage dataset and achieved competitive performance - Recall@1K (dev): 97.5 Query expansion Doc expansion P 9 Use T5 (trained on MS-MARCO passage) to generate queries (ie, expansions) for a given passage

Five Datasets with Varied Domain Shift Dataset Domain Task #
Query # Corpus Avg. relevant doc / query MS-MARCO passage Misc. Passage retrieval 6980 8.8M 1.1 MS-MARCO doc Misc. Document retrieval 5193 3.2M 1.1 ORCAS Misc. Document retrieval 9670 1.4M 1.8 Robust04 News Document retrieval 250 528K 69.9 TREC-COVID Bio-medical Document retrieval 50 191K 493.5 In-domain Different queries but same docs Increased domain shift In-domain but different task Very different domain News domain P 11

Query and document expansion are useful for BM25 Dataset Model
MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 (+1.3%) 17.95 91.64 (+0.8%) 22.69 78.85 (+1.7%) 24.53 79.02 (+8.5%) 30.83 52.58 (+6.7%) 30.98 3. BM25+docT5query 94.07 (+7.9%) 26.09 93.18 (+2.5%) 30.28 79.62 (+2.7%) 30.28 74.64 (+2.5%) 28.01 50.66 (+2.8%) 28.77 P 12

Deep retrieval model generalizes poorly on out-of-domain datasets Dataset Model
MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 (+1.3%) 17.95 91.64 (+0.8%) 22.69 78.85 (+1.7%) 24.53 79.02 (+8.5%) 30.83 52.58 (+6.7%) 30.98 3. BM25+docT5query 94.07 (+7.9%) 26.09 93.18 (+2.5%) 30.28 79.62 (+2.7%) 30.28 74.64 (+2.5%) 28.01 50.66 (+2.8%) 28.77 4. NPR 97.95 (+12.4%) 35.47 95.46 (+5.0%) 30.36 81.18 (+4.7%) 28.29 70.28 (-3.5%) 28.39 37.58 (-23.8%) 17.14 Good in-domain task generalization (passage->doc retrieval) Poor performance on out-of-domains that are very different from training domain P 14

Lexical and deep retrieval models are complementary Robust04 TREC-COVID -
The three methods retrieve different relevant results P 16

Our proposed zero-shot hybrid model is effective Dataset Model MS-MARCO
passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 17.95 91.64 22.69 78.85 24.53 79.02 30.83 52.58 30.98 3. BM25+docT5query 94.07 26.09 93.18 30.28 79.62 30.28 74.64 28.01 50.66 28.77 4. NPR 97.95 35.47 95.46 30.36 81.18 28.29 70.28 28.39 37.58 17.14 5. RRF(1, 4) 98.31 29.46 96.80 32.09 85.95 30.33 79.62 33.19 52.42 30.38 6. RRF(2, 4) 98.36 28.62 96.90 31.48 86.18 28.36 82.82 34.60 54.63 32.21 7. RRF(3, 4) 98.65 32.89 96.86 33.50 86.44 31.39 79.81 33.34 53.01 30.64 8. RRF(2, 3, 4) vs 1 vs 4 98.48 (+13.0%)* (+0.5%)* 29.58 96.96 (+6.7%)* (+1.6%)* 32.48 86.49 (+11.6%)* (+6.5%)* 29.74 82.65 (+13.5%)* (+17.6%)* 34.51 55.66 (+12.9%)* (+48.1%)* 34.22 P 18 *: p<0.05

Discussions - How does the zero-shot hybrid model compare to
interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? P 20

Our proposal is simpler and more effective than interpolation Robust04
TREC-COVID - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent P 22

TREC-COVID +5.5 (7.1%) - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent +1.5 (3.0%) P 23

TREC-COVID +2.4 (3.1%) - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent +1.5 (3.0%) +4.9 (9.6%) +5.5 (7.1%) P 24

A great potential for an even better fusion method Robust04
TREC-COVID +5.8 (7.1%) +12.4 (22.3%) - Oracle fusion: merge all relevant results from each method regardless of their ranking positions P 26

interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? - We hypothesize that the performance is correlated with query length/structure P 27

ORCAS recall@1K result by query token number (w/o stopwords) Query
length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Lexical models perform badly on single token queries that are OOV - A misspelled (eg, “ihpone6”) or a compound word (eg, “tvbythenumbers”) - Deep models can capture the semantics of these words from its sub-units. P 28

length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Both lexical models and deep model have a large performance improvement P 29

length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - The gap between lexical models and deep model is smaller P 30

length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Lexical models beat deep models on very long queries - Deep model performs poorly on long queries employing complex logic and seeking very speciﬁc information, eg, “According to piaget, which of the following abilities do children gain during middle childhood?” - Lexical models retrieve a doc containing the identical query but deep model fails. - Deep model is worse at capturing exact match (consistent with prior works) P 31

Conclusion - A deep retrieval model performs poorly on out-of-domain
datasets in a zero-shot setting - A deep retrieval model and lexical models (with query/doc expansions) are complementary to each other - Propose a simple non-parametric hybrid model based on RRF to combine lexical and deep model results - Good performance in five datasets with zero-shot setting - Future work - Parameterize the hybrid model using query structure, query length, and the degree of domain shift - Improve the out-of-domain deep retrieval models via domain adaptation P 32

Out-of-Domain Semantics to the Rescue! Zero-Sho...

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Tao Chen

More Decks by Tao Chen

Other Decks in Research

Featured

Transcript

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models Tao

Two First-stage Retrieval Paradigms query docs What is blood sugar

Two First-stage Retrieval Paradigms query docs What is blood sugar

Two First-stage Retrieval Paradigms query docs What is blood sugar

Three Research Questions - RQ 1: Can deep retrieval model

Existing Hybrid Retrieval Models 1. Interpolation of lexical and deep

Three Research Questions - RQ 1: Can deep retrieval model

Our proposal: Non-parametric Hybrid Model - Goal: A hybrid model

Our proposal: Using RRF to Fuse Lexical and Deep Retrieval

Three Research Questions - RQ 1: Can deep retrieval model

Five Datasets with Varied Domain Shift Dataset Domain Task #

Query and document expansion are useful for BM25 Dataset Model

Three Research Questions - RQ 1: Can deep retrieval model

Deep retrieval model generalizes poorly on out-of-domain datasets Dataset Model

Three Research Questions - RQ 1: Can deep retrieval model

Lexical and deep retrieval models are complementary Robust04 TREC-COVID -

Three Research Questions - RQ 1: Can deep retrieval model

Our proposed zero-shot hybrid model is effective Dataset Model MS-MARCO

Three Research Questions - RQ 1: Can deep retrieval model

Discussions - How does the zero-shot hybrid model compare to

Discussions - How does the zero-shot hybrid model compare to

Our proposal is simpler and more effective than interpolation Robust04

Our proposal is simpler and more effective than interpolation Robust04

Our proposal is simpler and more effective than interpolation Robust04

Discussions - How does the zero-shot hybrid model compare to

A great potential for an even better fusion method Robust04

Discussions - How does the zero-shot hybrid model compare to

ORCAS recall@1K result by query token number (w/o stopwords) Query

ORCAS recall@1K result by query token number (w/o stopwords) Query

ORCAS recall@1K result by query token number (w/o stopwords) Query

ORCAS recall@1K result by query token number (w/o stopwords) Query

Conclusion - A deep retrieval model performs poorly on out-of-domain