Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Tao Chen
April 12, 2022

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Presented at the 44th European Conference on Information Retrieval (ECIR'22)

Tao Chen

April 12, 2022
Tweet

More Decks by Tao Chen

Other Decks in Research

Transcript

  1. Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models Tao

    Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, Marc Najork Paper Slides
  2. Two First-stage Retrieval Paradigms query docs What is blood sugar

    range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Lexical Exact Match Lexical retrieval model (eg, BM25) - Efficient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) P 2
  3. Two First-stage Retrieval Paradigms query docs What is blood sugar

    range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Semantic Match Lexical Exact Match Lexical retrieval model (eg, BM25) - Efficient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) Deep retrieval model (eg, dual-encoder BERT) - Winner in many passage retrieval tasks - Training is expensive (computation, labeled dataset) P 3
  4. Two First-stage Retrieval Paradigms query docs What is blood sugar

    range? A fasting blood sugar of 99 mg/dL or less is considered within the normal range… Semantic Match Lexical Exact Match Lexical retrieval model (eg, BM25) - Efficient and simple (can be applied to different domains) - Vulnerable to vocabulary mismatch (query/doc expansion can alleviate this) Deep retrieval model (eg, dual-encoder BERT) - Winner in many passage retrieval tasks - Training is expensive (computation, labeled dataset) P 4 Can we apply the model to a new domain without fine-tuning?
  5. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 5
  6. Existing Hybrid Retrieval Models 1. Interpolation of lexical and deep

    retrieval models (the most common approach, eg, [Karpukhin et al., 2020, Luan et al., 2021]) Need raw score normalization and weight tuning 2. Use RM3 (top lexical results are considered as relevance feedback) to select deep retriever results and combine them with top lexical results [Kuzi et al., 2020] 3. Combine the top results of two models in an alternative way [Zhan et al., 2020] P 6 Existing models were not evaluated in a zero-shot setup
  7. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? We are the first to propose and evaluate a hybrid retrieval model in a zero-shot setup P 7
  8. Our proposal: Non-parametric Hybrid Model - Goal: A hybrid model

    can be easily applied to a new domain in a zero shot setting - A simple fusion framework based on Reciprocal Rank Fusion (RRF) Models The rank for d by model m query doc k=60 (following the original paper [Cormack et al., 2009]) - Non-parametric - Easy to fuse more than two models - Flexible to plug/unplug a model P 8
  9. Our proposal: Using RRF to Fuse Lexical and Deep Retrieval

    Models Lexical retrieval models - BM25 - BM25+RM3 - BM25+Bo1 - BM25+docT5query [Nogueira and Lin, 2019] Deep retrieval model: NPR [Lu et al., 2021] - A dual encoder BERT improved with synthetic data pre-training and hard negative sampling - Trained on MS-MARCO passage dataset and achieved competitive performance - Recall@1K (dev): 97.5 Query expansion Doc expansion P 9 Use T5 (trained on MS-MARCO passage) to generate queries (ie, expansions) for a given passage
  10. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 10
  11. Five Datasets with Varied Domain Shift Dataset Domain Task #

    Query # Corpus Avg. relevant doc / query MS-MARCO passage Misc. Passage retrieval 6980 8.8M 1.1 MS-MARCO doc Misc. Document retrieval 5193 3.2M 1.1 ORCAS Misc. Document retrieval 9670 1.4M 1.8 Robust04 News Document retrieval 250 528K 69.9 TREC-COVID Bio-medical Document retrieval 50 191K 493.5 In-domain Different queries but same docs Increased domain shift In-domain but different task Very different domain News domain P 11
  12. Query and document expansion are useful for BM25 Dataset Model

    MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 (+1.3%) 17.95 91.64 (+0.8%) 22.69 78.85 (+1.7%) 24.53 79.02 (+8.5%) 30.83 52.58 (+6.7%) 30.98 3. BM25+docT5query 94.07 (+7.9%) 26.09 93.18 (+2.5%) 30.28 79.62 (+2.7%) 30.28 74.64 (+2.5%) 28.01 50.66 (+2.8%) 28.77 P 12
  13. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 13
  14. Deep retrieval model generalizes poorly on out-of-domain datasets Dataset Model

    MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 (+1.3%) 17.95 91.64 (+0.8%) 22.69 78.85 (+1.7%) 24.53 79.02 (+8.5%) 30.83 52.58 (+6.7%) 30.98 3. BM25+docT5query 94.07 (+7.9%) 26.09 93.18 (+2.5%) 30.28 79.62 (+2.7%) 30.28 74.64 (+2.5%) 28.01 50.66 (+2.8%) 28.77 4. NPR 97.95 (+12.4%) 35.47 95.46 (+5.0%) 30.36 81.18 (+4.7%) 28.29 70.28 (-3.5%) 28.39 37.58 (-23.8%) 17.14 Good in-domain task generalization (passage->doc retrieval) Poor performance on out-of-domains that are very different from training domain P 14
  15. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 15
  16. Lexical and deep retrieval models are complementary Robust04 TREC-COVID -

    The three methods retrieve different relevant results P 16
  17. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 17
  18. Our proposed zero-shot hybrid model is effective Dataset Model MS-MARCO

    passage MS-MARCO doc ORCAS Robust04 TREC-COVID R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP 1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86 2. BM25+Bo1 88.27 17.95 91.64 22.69 78.85 24.53 79.02 30.83 52.58 30.98 3. BM25+docT5query 94.07 26.09 93.18 30.28 79.62 30.28 74.64 28.01 50.66 28.77 4. NPR 97.95 35.47 95.46 30.36 81.18 28.29 70.28 28.39 37.58 17.14 5. RRF(1, 4) 98.31 29.46 96.80 32.09 85.95 30.33 79.62 33.19 52.42 30.38 6. RRF(2, 4) 98.36 28.62 96.90 31.48 86.18 28.36 82.82 34.60 54.63 32.21 7. RRF(3, 4) 98.65 32.89 96.86 33.50 86.44 31.39 79.81 33.34 53.01 30.64 8. RRF(2, 3, 4) vs 1 vs 4 98.48 (+13.0%)* (+0.5%)* 29.58 96.96 (+6.7%)* (+1.6%)* 32.48 86.49 (+11.6%)* (+6.5%)* 29.74 82.65 (+13.5%)* (+17.6%)* 34.51 55.66 (+12.9%)* (+48.1%)* 34.22 P 18 *: p<0.05
  19. Three Research Questions - RQ 1: Can deep retrieval model

    generalize to a new domain in a zero-shot setting? - RQ 2: Is deep retrieval complementary to lexical matching and query and document expansion? - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in a non-parametric hybrid retrieval model? P 19
  20. Discussions - How does the zero-shot hybrid model compare to

    interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? P 20
  21. Discussions - How does the zero-shot hybrid model compare to

    interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? P 21
  22. Our proposal is simpler and more effective than interpolation Robust04

    TREC-COVID - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent P 22
  23. Our proposal is simpler and more effective than interpolation Robust04

    TREC-COVID +5.5 (7.1%) - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent +1.5 (3.0%) P 23
  24. Our proposal is simpler and more effective than interpolation Robust04

    TREC-COVID +2.4 (3.1%) - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent +1.5 (3.0%) +4.9 (9.6%) +5.5 (7.1%) P 24
  25. Discussions - How does the zero-shot hybrid model compare to

    interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? P 25
  26. A great potential for an even better fusion method Robust04

    TREC-COVID +5.8 (7.1%) +12.4 (22.3%) - Oracle fusion: merge all relevant results from each method regardless of their ranking positions P 26
  27. Discussions - How does the zero-shot hybrid model compare to

    interpolation method? - What’s the upper bound of a fusion based hybrid model? - When does deep/lexical model perform better than the other? - We hypothesize that the performance is correlated with query length/structure P 27
  28. ORCAS recall@1K result by query token number (w/o stopwords) Query

    length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Lexical models perform badly on single token queries that are OOV - A misspelled (eg, “ihpone6”) or a compound word (eg, “tvbythenumbers”) - Deep models can capture the semantics of these words from its sub-units. P 28
  29. ORCAS recall@1K result by query token number (w/o stopwords) Query

    length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Both lexical models and deep model have a large performance improvement P 29
  30. ORCAS recall@1K result by query token number (w/o stopwords) Query

    length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - The gap between lexical models and deep model is smaller P 30
  31. ORCAS recall@1K result by query token number (w/o stopwords) Query

    length Model 1 2 3 4 5 6 7 8 9 10 All 1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52 2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85 3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62 4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18 - Lexical models beat deep models on very long queries - Deep model performs poorly on long queries employing complex logic and seeking very specific information, eg, “According to piaget, which of the following abilities do children gain during middle childhood?” - Lexical models retrieve a doc containing the identical query but deep model fails. - Deep model is worse at capturing exact match (consistent with prior works) P 31
  32. Conclusion - A deep retrieval model performs poorly on out-of-domain

    datasets in a zero-shot setting - A deep retrieval model and lexical models (with query/doc expansions) are complementary to each other - Propose a simple non-parametric hybrid model based on RRF to combine lexical and deep model results - Good performance in five datasets with zero-shot setting - Future work - Parameterize the hybrid model using query structure, query length, and the degree of domain shift - Improve the out-of-domain deep retrieval models via domain adaptation P 32