Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modeling Queries as Bags of Documents

Modeling Queries as Bags of Documents

This Search Solutions 2024 presentation introduces the bag-of-documents model as a way to align query and document representations — specifically addressing the gap between the broad variability of query intents and the inherent specificity of individual documents or products. It describes how to compute bag-of-documents representations of frequent queries by aggregating document vectors from their clicks and then using those query vectors as training data to build a sentence transformer model for infrequent queries. It then shows how the bag-of-documents model is useful to recognize query similarity and compute query specificity, both of which are helpful for improving quality, experience, and analytics for search applications.

Avatar for Daniel Tunkelang

Daniel Tunkelang

May 25, 2026

More Decks by Daniel Tunkelang

Other Decks in Technology

Transcript

  1. Overview • Motivation ◦ Why model queries as bags of

    documents? • Approach ◦ How do we compute the query vectors? • Applications ◦ What can we do with these query vectors?
  2. • Search queries don’t always look like product titles. iphone

    laptops headphones Motivation Straight Talk Apple iPhone 13, 128GB, Midnight - Prepaid Smartphone [Locked to Straight Talk] HP 14 inch Laptop Intel Core i3-N305 8GB RAM 256GB SSD Moonlight Blue Beats Solo3 Wireless On-Ear Headphones - Gold
  3. Fundamental Misalignment • Not just a matter of translation or

    rewording. • Query intents vary in specificity and can be broad, while products are inherently singular. • So we cannot simply translate the query into a hypothetical document. [Gao et al, 2023]
  4. Query Understanding • If search application uses embedding-based retrieval, then

    it is essential to have robust query vectors! [−0.9704, 0.2045, 0.1281 … ] Daniel
  5. Note: Density is not Destiny! • Most queries are short

    sequences of entities. • For such queries, focus on precision and desirability. • Embedding-based retrieval helps more for recall. • Query similarity addresses tail queries for head intents.
  6. Approach [0.13, 0.81, …], [0.09, 0.75, …], [0.98, 0.77, …],…

    [0.11, 0.79, … ] mens black tshirts • Model query as a bag of documents. [Mandal et al, 2023]
  7. Frequent Queries • Associate queries with products based on engagement.

    • Can use clicks, purchases, or other engagement signals. • Aggregate product vectors to obtain query representation. • Specifically compute the mean and a kind of “variance”.
  8. Infrequent Queries • Could obtain products from retrieval at query

    time. • More efficient to train a sentence transformer model.
  9. Query Similarity ► ► [0.13, 0.81, … ] [0.09, 0.75,

    … ] … ► [0.11, 0.79, … ] [0.13, 0.81, … ] [0.09, 0.77, … ] … ► [0.12, 0.78, … ] ► cos = 0.98 black tshirts for men mens black t-shirt
  10. Improving Recall and More • Query similarity can replace token-level

    query expansion and query relaxation with a holistic approach. • Identifying equivalent queries defragments search intents that are spread across multiple queries in autocomplete, search suggestions, analytics, etc. • Can even relate keyword queries to browse nodes.
  11. Specificity = Variance • Query vector is mean of the

    document vectors in the bag. • Query specificity measures how tightly the document vectors in the bag cluster around the mean query vector. • Specificity is a continuous measure that captures the intuition of a spectrum between broad and narrow intent.
  12. Computing Specificity for Frequent Queries [0.13, 0.81, …], [0.09, 0.75,

    …], [0.98, 0.77, …],… [0.11, 0.79, … ] mens black tshirts 0.82 0.75 0.81 0.79 • Mean of cosine of query vector and document vectors.
  13. Computing Specificity for Infrequent Queries • Can compute based on

    retrieval at query time. • Better to train a transformer-based regression model.
  14. Informing Search Experience and Tradeoffs • Low query specificity can

    trigger interface elements that elicit more signal from the searcher, e.g., refinements. • Autocomplete can favor high-specificity queries, which are more likely to lead to a conversion. • High specificity means that relevance is critical, while lower specificity suggests there is more room to trade off relevance for desirability or other factors.
  15. Summary • Bag-of-documents model aligns query and product vectors. •

    Aggregate document vectors to obtain query vectors for frequent queries. Train a model for infrequent queries. • Apply bag-of-documents to compute query similarity and specificity, improving retrieval and search experience.