$30 off During Our Annual Pro Sale. View Details »

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Tao Chen
April 12, 2022

Out-of-Domain Semantics to the Rescue! Zero-Shot Hybrid Retrieval Models

Presented at the 44th European Conference on Information Retrieval (ECIR'22)

Tao Chen

April 12, 2022
Tweet

More Decks by Tao Chen

Other Decks in Research

Transcript

  1. Out-of-Domain Semantics to
    the Rescue! Zero-Shot Hybrid
    Retrieval Models
    Tao Chen, Mingyang Zhang, Jing Lu,
    Michael Bendersky, Marc Najork
    Paper Slides

    View Slide

  2. Two First-stage Retrieval Paradigms
    query docs
    What is blood
    sugar range?
    A fasting blood sugar of 99 mg/dL or less
    is considered within the normal range…
    Lexical Exact Match Lexical retrieval model (eg, BM25)
    - Efficient and simple
    (can be applied to different domains)
    - Vulnerable to vocabulary mismatch
    (query/doc expansion can alleviate this)
    P 2

    View Slide

  3. Two First-stage Retrieval Paradigms
    query docs
    What is blood
    sugar range?
    A fasting blood sugar of 99 mg/dL or less
    is considered within the normal range…
    Semantic Match
    Lexical Exact Match Lexical retrieval model (eg, BM25)
    - Efficient and simple
    (can be applied to different domains)
    - Vulnerable to vocabulary mismatch
    (query/doc expansion can alleviate this)
    Deep retrieval model (eg, dual-encoder BERT)
    - Winner in many passage retrieval tasks
    - Training is expensive
    (computation, labeled dataset)
    P 3

    View Slide

  4. Two First-stage Retrieval Paradigms
    query docs
    What is blood
    sugar range?
    A fasting blood sugar of 99 mg/dL or less
    is considered within the normal range…
    Semantic Match
    Lexical Exact Match Lexical retrieval model (eg, BM25)
    - Efficient and simple
    (can be applied to different domains)
    - Vulnerable to vocabulary mismatch
    (query/doc expansion can alleviate this)
    Deep retrieval model (eg, dual-encoder BERT)
    - Winner in many passage retrieval tasks
    - Training is expensive
    (computation, labeled dataset)
    P 4
    Can we apply the model to a new
    domain without fine-tuning?

    View Slide

  5. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    P 5

    View Slide

  6. Existing Hybrid Retrieval Models
    1. Interpolation of lexical and deep retrieval models (the most common approach, eg,
    [Karpukhin et al., 2020, Luan et al., 2021])
    Need raw score normalization and weight tuning
    2. Use RM3 (top lexical results are considered as relevance feedback) to select deep
    retriever results and combine them with top lexical results [Kuzi et al., 2020]
    3. Combine the top results of two models in an alternative way [Zhan et al., 2020]
    P 6
    Existing models were not evaluated in a zero-shot setup

    View Slide

  7. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    We are the first to propose and evaluate a hybrid retrieval model in a zero-shot setup
    P 7

    View Slide

  8. Our proposal: Non-parametric Hybrid Model
    - Goal: A hybrid model can be easily applied to a new domain in a zero shot setting
    - A simple fusion framework based on Reciprocal Rank Fusion (RRF)
    Models The rank for d by model m
    query doc
    k=60 (following the original
    paper [Cormack et al., 2009])
    - Non-parametric
    - Easy to fuse more than two
    models
    - Flexible to plug/unplug a
    model
    P 8

    View Slide

  9. Our proposal: Using RRF to Fuse Lexical and Deep Retrieval Models
    Lexical retrieval models
    - BM25
    - BM25+RM3
    - BM25+Bo1
    - BM25+docT5query [Nogueira and Lin, 2019]
    Deep retrieval model: NPR [Lu et al., 2021]
    - A dual encoder BERT improved with synthetic
    data pre-training and hard negative sampling
    - Trained on MS-MARCO passage dataset and
    achieved competitive performance
    - Recall@1K (dev): 97.5
    Query expansion
    Doc expansion
    P 9
    Use T5 (trained on MS-MARCO passage)
    to generate queries (ie, expansions) for a
    given passage

    View Slide

  10. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    P 10

    View Slide

  11. Five Datasets with Varied Domain Shift
    Dataset Domain Task # Query # Corpus Avg. relevant
    doc / query
    MS-MARCO
    passage
    Misc. Passage
    retrieval
    6980 8.8M 1.1
    MS-MARCO
    doc
    Misc. Document
    retrieval
    5193 3.2M 1.1
    ORCAS Misc. Document
    retrieval
    9670 1.4M 1.8
    Robust04 News Document
    retrieval
    250 528K 69.9
    TREC-COVID Bio-medical Document
    retrieval
    50 191K 493.5
    In-domain
    Different
    queries but
    same docs
    Increased domain shift
    In-domain but
    different task
    Very different
    domain
    News domain
    P 11

    View Slide

  12. Query and document expansion are useful for BM25
    Dataset
    Model
    MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID
    R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP
    1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86
    2. BM25+Bo1 88.27
    (+1.3%)
    17.95 91.64
    (+0.8%)
    22.69 78.85
    (+1.7%)
    24.53 79.02
    (+8.5%)
    30.83 52.58
    (+6.7%)
    30.98
    3. BM25+docT5query 94.07
    (+7.9%)
    26.09 93.18
    (+2.5%)
    30.28 79.62
    (+2.7%)
    30.28 74.64
    (+2.5%)
    28.01 50.66
    (+2.8%)
    28.77
    P 12

    View Slide

  13. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot
    setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    P 13

    View Slide

  14. Deep retrieval model generalizes poorly on out-of-domain datasets
    Dataset
    Model
    MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID
    R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP
    1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86
    2. BM25+Bo1 88.27
    (+1.3%)
    17.95 91.64
    (+0.8%)
    22.69 78.85
    (+1.7%)
    24.53 79.02
    (+8.5%)
    30.83 52.58
    (+6.7%)
    30.98
    3. BM25+docT5query 94.07
    (+7.9%)
    26.09 93.18
    (+2.5%)
    30.28 79.62
    (+2.7%)
    30.28 74.64
    (+2.5%)
    28.01 50.66
    (+2.8%)
    28.77
    4. NPR 97.95
    (+12.4%)
    35.47 95.46
    (+5.0%)
    30.36 81.18
    (+4.7%)
    28.29 70.28
    (-3.5%)
    28.39 37.58
    (-23.8%)
    17.14
    Good in-domain task
    generalization
    (passage->doc retrieval)
    Poor performance on
    out-of-domains that are very
    different from training domain
    P 14

    View Slide

  15. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    P 15

    View Slide

  16. Lexical and deep retrieval models are complementary
    Robust04 TREC-COVID
    - The three methods retrieve different relevant results
    P 16

    View Slide

  17. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined
    in a non-parametric hybrid retrieval model?
    P 17

    View Slide

  18. Our proposed zero-shot hybrid model is effective
    Dataset
    Model
    MS-MARCO passage MS-MARCO doc ORCAS Robust04 TREC-COVID
    R@1K MAP R@1K MAP R@1K MAP R@1K MAP R@1K MAP
    1. BM25 87.18 19.32 90.91 26.50 77.52 27.10 72.84 26.91 49.29 27.86
    2. BM25+Bo1 88.27 17.95 91.64 22.69 78.85 24.53 79.02 30.83 52.58 30.98
    3. BM25+docT5query 94.07 26.09 93.18 30.28 79.62 30.28 74.64 28.01 50.66 28.77
    4. NPR 97.95 35.47 95.46 30.36 81.18 28.29 70.28 28.39 37.58 17.14
    5. RRF(1, 4) 98.31 29.46 96.80 32.09 85.95 30.33 79.62 33.19 52.42 30.38
    6. RRF(2, 4) 98.36 28.62 96.90 31.48 86.18 28.36 82.82 34.60 54.63 32.21
    7. RRF(3, 4) 98.65 32.89 96.86 33.50 86.44 31.39 79.81 33.34 53.01 30.64
    8. RRF(2, 3, 4)
    vs 1
    vs 4
    98.48
    (+13.0%)*
    (+0.5%)*
    29.58 96.96
    (+6.7%)*
    (+1.6%)*
    32.48 86.49
    (+11.6%)*
    (+6.5%)*
    29.74 82.65
    (+13.5%)*
    (+17.6%)*
    34.51 55.66
    (+12.9%)*
    (+48.1%)*
    34.22
    P 18
    *: p<0.05

    View Slide

  19. Three Research Questions
    - RQ 1: Can deep retrieval model generalize to a new domain in a zero-shot setting?
    - RQ 2: Is deep retrieval complementary to lexical matching and query and
    document expansion?
    - RQ 3: Can lexical matching, expansion, and deep retrieval models be combined in
    a non-parametric hybrid retrieval model?
    P 19

    View Slide

  20. Discussions
    - How does the zero-shot hybrid model compare to interpolation method?
    - What’s the upper bound of a fusion based hybrid model?
    - When does deep/lexical model perform better than the other?
    P 20

    View Slide

  21. Discussions
    - How does the zero-shot hybrid model compare to interpolation method?
    - What’s the upper bound of a fusion based hybrid model?
    - When does deep/lexical model perform better than the other?
    P 21

    View Slide

  22. Our proposal is simpler and more effective than interpolation
    Robust04 TREC-COVID
    - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent
    P 22

    View Slide

  23. Our proposal is simpler and more effective than interpolation
    Robust04 TREC-COVID
    +5.5 (7.1%)
    - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent
    +1.5 (3.0%)
    P 23

    View Slide

  24. Our proposal is simpler and more effective than interpolation
    Robust04 TREC-COVID
    +2.4 (3.1%)
    - Interpolation is sensitive to weight ɑ, and the best weight is dataset dependent
    +1.5 (3.0%)
    +4.9 (9.6%)
    +5.5 (7.1%)
    P 24

    View Slide

  25. Discussions
    - How does the zero-shot hybrid model compare to interpolation method?
    - What’s the upper bound of a fusion based hybrid model?
    - When does deep/lexical model perform better than the other?
    P 25

    View Slide

  26. A great potential for an even better fusion method
    Robust04 TREC-COVID
    +5.8 (7.1%)
    +12.4 (22.3%)
    - Oracle fusion: merge all relevant results from each method regardless of their
    ranking positions
    P 26

    View Slide

  27. Discussions
    - How does the zero-shot hybrid model compare to interpolation method?
    - What’s the upper bound of a fusion based hybrid model?
    - When does deep/lexical model perform better than the other?
    - We hypothesize that the performance is correlated with query length/structure
    P 27

    View Slide

  28. ORCAS recall@1K result by query token number (w/o stopwords)
    Query length
    Model 1 2 3 4 5 6 7 8 9 10 All
    1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52
    2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85
    3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62
    4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18
    - Lexical models perform badly on single token queries that are OOV
    - A misspelled (eg, “ihpone6”) or a compound word (eg, “tvbythenumbers”)
    - Deep models can capture the semantics of these words from its sub-units.
    P 28

    View Slide

  29. ORCAS recall@1K result by query token number (w/o stopwords)
    Query length
    Model 1 2 3 4 5 6 7 8 9 10 All
    1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52
    2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85
    3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62
    4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18
    - Both lexical models and deep model have a large performance improvement
    P 29

    View Slide

  30. ORCAS recall@1K result by query token number (w/o stopwords)
    Query length
    Model 1 2 3 4 5 6 7 8 9 10 All
    1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52
    2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85
    3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62
    4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18
    - The gap between lexical models and deep model is smaller
    P 30

    View Slide

  31. ORCAS recall@1K result by query token number (w/o stopwords)
    Query length
    Model 1 2 3 4 5 6 7 8 9 10 All
    1. BM25 34.99 74.21 80.17 81.98 82.31 82.85 81.13 87.84 85.64 87.47 77.52
    2. BM25+Bo1 39.50 76.57 81.33 83.08 83.28 82.97 82.57 88.14 86.27 87.85 78.85
    3. BM25+docT5query 36.92 77.78 83.02 84.14 86.15 84.94 83.76 88.61 85.83 87.83 79.62
    4. NPR 59.81 82.47 85.79 86.59 87.56 85.55 82.79 83.32 80.14 75.87 81.18
    - Lexical models beat deep models on very long queries
    - Deep model performs poorly on long queries employing complex logic and
    seeking very specific information, eg,
    “According to piaget, which of the following abilities do children gain during middle childhood?”
    - Lexical models retrieve a doc containing the identical query but deep model fails.
    - Deep model is worse at capturing exact match (consistent with prior works)
    P 31

    View Slide

  32. Conclusion
    - A deep retrieval model performs poorly on out-of-domain datasets in a zero-shot
    setting
    - A deep retrieval model and lexical models (with query/doc expansions) are
    complementary to each other
    - Propose a simple non-parametric hybrid model based on RRF to combine lexical
    and deep model results
    - Good performance in five datasets with zero-shot setting
    - Future work
    - Parameterize the hybrid model using query structure, query length, and the degree of
    domain shift
    - Improve the out-of-domain deep retrieval models via domain adaptation
    P 32

    View Slide