$30 off During Our Annual Pro Sale. View Details »

A Practical Approach To Semantic Search

rejasupotaro
June 11, 2024
350

A Practical Approach To Semantic Search

rejasupotaro

June 11, 2024
Tweet

Transcript

  1. A Practical Approach To Semantic Search How can embeddings improve

    your product? Kentaro Takiguchi @Berlin Buzzwords 2024
  2. About Me and Company • Search Relevance Engineer (Linguistics and

    Evaluation) • Mercari is a Japanese two-sided marketplace platform ◦ Users buy and sell their items ≈ Product Search
  3. • Question Answering ◦ Query: “What are some examples of

    unsecured loans?” ◦ Doc: “Auto loans are secured against the car. "Signature" loans, from a bank that knows and trusts you, are typically unsecured. unsecured loans other than informal ones or these are fairly rare. Most lenders don't want to take the additional risk, or balance that risk with a high enough interest rate to make the unsecured loan unattractive.” • Product Search ◦ Query: “headphones noise-cancelling” ◦ Doc: “bluetooth noise-cancelling headphones 60H playtime High Res Audio” How (Lexical) (Product) Search Works System design varies depending on the domain Each query term acts as a filter condition Retrieve the doc that satisfies the central aspect Users expect the product to satisfy all the requirements
  4. • Question Answering ◦ The extent to which the answer

    pertains to the topic of the question ◦ Merely rely on term overlap, it is not necessary for all the query terms to appear in the doc • Product Search ◦ How well the item satisfies the given specifications ◦ Query terms are used to narrow down the search space Relevance Definition
  5. e.g. Query: “nike shoes” nike shoes Exact (A ∧ B)

    Substitute (A ∨ B) Irrelevant ¬(A ∨ B) Note: Each filter condition is not necessarily a single term • “nike shoes” 👉 Exact • “adidas shoes” 👉 Substitute • “hair dryer” 👉 Irrelevant
  6. What’s wrong with “Traditional Search”? Query Challenges “women’s running shoe”

    woman vs. women, shoe vs. shoes “headphones not wierless” negation + spelling error “christmas pjs for toddlers” pjs = pajamas, toddlers = kids, children “affordable laptops with good battery life” subjective terms Fragility to spelling variants Linguistic Ambiguity Lack of understanding of semantic relationships well-known typical issues, solutions do exist
  7. “Cat tower for cats with low-mobility” → Accessible and comfortable

    for cats who may have difficulty climbing or jumping → • Low Platforms • Gradual Steps or Ramps • Soft Surface Hidden requirements don’t appear in the query text
  8. Modern Lexical Search is complex Tokenizer Lexical Search (Subcomponents +

    Heuristics) Scoring Function (Field Weights, Boosting Functions) … Search Infrastructure Spelling Corrector Ontology Classifier Each component requires expertise
  9. • The system can’t be optimized directly ◦ The index

    is not differenciable • Add more subcomponents ◦ How to measure the effectiveness of the subcomponents? ◦ Are their offline metrics correlated well with online metrics? Why Lexical Search becomes complex? Objective Synonym Dictionary Lexical Search ? The number of synonyms ∝ Revenue?
  10. Semantic Search is much simpler Query Encoder Expected Input (Query)

    Doc Encoder ⨀ Ideal Output (Doc) Addressing subproblems → Directly optimizing the system Simpler, cheaper, and more effective (???)
  11. • Q. Is Semantic Search a Hype? ◦ 👉 Lexical

    Search vs. Semantic Search? • Q. How can we incorporate semantic matching signals? ◦ 👉 Integration of Semantic Search • Q. Is Semantic Search cost-effective? ◦ 👉 Development of Semantic Search Questions to explore
  12. • Background • Lexical Search vs. Semantic Search • Integration

    of Semantic Search • Development of Semantic Search Agenda
  13. Is it a fair comparison? Task setting? Evaluation metrics? Dataset?

    • Public benchmark results may not be entirely reliable • Their experiments are not designed for your domain • You need to verify the effectiveness yourself Implementation?
  14. Publicly Available Datasets Dataset Task Median Query Length Median Doc

    Length Number of test docs Number of test queries Amazon ESCI Product Search 4 137 482105 8956 FiQA Question Answering 10 90 57638 648 DBPedia Entity Retrieval 5 47 4635922 400 Quora Duplicate Question Retrieval 9 10 522931 10000 https://opensearch.org/blog/semantic-science-benchmarks/
  15. • Product Catalog ◦ title, brand, color, description, … •

    Relevance Judgements ◦ Exact > Substitute > Complement > Irrelevant ESCI Dataset https://github.com/amazon-science/esci-data Exact Substitute
  16. • Choose appropriate metrics • Rank-based Metrics ◦ NDCG vs.

    MRR vs. MAP • Rank-based Metrics vs. Set-based Metrics ◦ NDCG vs. Precision & Recall “Fair” Evaluation ? Appropriate metrics = Metrics correlated well with online metrics
  17. • Gain decays from top to bottom ◦ The value

    of relevant item at 1st >>> at 50th Rank-based = Top-heaviness How gain decays How attention decays (click position distribution)
  18. • MAP (Binary) ◦ Relevance or Irrelevant • NDCG (Graded)

    ◦ Purchase > Like > Click > Not clicked MAP (Binary) vs. NDCG (Graded)
  19. MRR vs. NDCG • MRR: 0.5 • NDCG: 0.6309 •

    MRR: 0.25 • NDCG: 0.885 MRR doesn’t account for the presence of multiple relevant items
  20. Measuring NDCG alone is sufficient? • NDCG: 0.8585 • Precision:

    0.1167 • NDCG: 0.8564 • Precision: 0.8833
  21. What if we have nothing in our inventory? • NDCG:

    1.0 • Precision: 0.0167 • NDCG: 0.6309 • Precision: 0.5
  22. Interpretation NDCG Precision Recall 1. Ideal High High High 2.

    Great High High Low 3. Acceptable Low High High 4. Unacceptable High Low High - Low Low Low NDCG: High, Prec: Low → Unacceptable We found that NDCG has a positive correlation with revenue while Precision has a negative correlation with the number of complaints from users
  23. • Users examine many items to find a better product

    at the lowest price • Users complain when they see irrelevant items Evaluating Product Search • MRR is not appropriate • Small k is insufficient (such as metric@10) • We can’t rely sorely on NDCG, Precision matters a lot
  24. “Fair” Implementation • In the literature ◦ Fine-Tuned Semantic Search

    vs. Barely optimized poor BM25 • In reality ◦ Newly built Semantic Search vs. An established search system
  25. Lexical Search Results Method NDCG@100 Precision@100 Recall@100 Total Hits Lexical

    (Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Lexical (Combined) 0.5107 0.2362 0.3798 325 Lexical (Field Weights) 0.4896 0.2557 0.3409 259 Lexical (Synonyms) 0.4723 0.2333 0.3632 325 Lexical (Naive) 0.4557 0.2535 0.3285 259 • Combined = Synonym expansion with different weights, field weights tuned by Bayesian Optimization, phrase match boost, … • Relaxation = Query Relaxation is applied for low-hit queries
  26. Lexical Search - Query Relaxation A B A’ “black and

    brown tan steering wheel cover” “black and brown tan steering wheel cover” • NDCG 📈 • Precision 📉
  27. Semantic Search to compare Query Encoder Query Product Encoder ⨀

    Text Encoder (LM) Text Encoder (LM) How to sample pairs? Which field to encode? Which loss to use? Thresholding? Features Features Product Which pre-trained model to use? • Incorporating Additional Features: https://arxiv.org/abs/2306.04833 • Multi-Stage Training: https://www.amazon.science/publications/web-scale-semantic-product-search-with-large-language-models Heuristic post-filters?
  28. Method NDCG@100 Precision@100 Recall@100 Total Hits Semantic (CL) 0.5328 0.0651

    0.5456 100 Semantic (CL + Thresholding) 0.5221 0.1284 0.4814 67 Semantic (MLM + CL) 0.5046 0.0613 0.5121 100 Semantic (Naive) 0.4984 0.0561 0.4771 100 Semantic Search Results • MLM: Fine-tuning the LM through Masked Language Modeling • CL: Fine-tuning the bi-endcoer through Contrastive Learning • The optimal or reasonable loss, features, pre-trained model are used
  29. Lexical Search vs. Semantic Search Method NDCG@100 Precision@100 Recall@100 Total

    Hits Lexical (Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Semantic (CL) 0.5328 0.0651 0.5456 100 Lexical (Combined) 0.5107 0.2362 0.3798 325 Semantic (Naive) 0.4984 0.0561 0.4771 100 Lexical (Naive) 0.4557 0.2535 0.3285 259 NDCG is almost the same Precision of Semantic Search is low = Unacceptable
  30. • Queries are often given as a set of keywords

    ◦ Users copy & paste product names • User Survey: ◦ Q. When do you use the platform? ◦ A. When searching for a specific item to buy (66.4%) This result is not surprising because Query: “product attribute attribute …”
  31. 2 1 9 7 3 6 Lexical Search@k 5 4

    unseen relevant Semantic Search@k irrelevant 8 1 k items • Found in lexical search: {1,2,3,5,8} → 5 items • Found in semantic search: {3,6,1,9,2,4,7} → 7 items • Found in both: {1,3} → 2 items NDCG is almost the same but what are retrieved? Method NDCG@100 Lexical (Combined + Relaxation) 0.5394 Semantic (CL) 0.5328 3
  32. How similar the two systems are? Products Retrieved from Lexical

    Search@100 Products Retrieved from Semantic Search@100 Found in Both • In terms of NDCG, they are “the same” but they retrieve different products • How they differ?
  33. Method NDCG (Lexical) NDCG (Semantic) harley quinn costume for toddler

    girls 0.0 0.89 accidental love by gary soto 0.0 1.0 baby boy first birthday cookie montser decorations 0.0 0.8936 I want a long jacket that comes to mid leg in a dark colour and very warm 0.0 0.1673 Method NDCG (Lexical) NDCG (Semantic) “the first 90 days” (book title) 0.6166 0.1934 “summit 470” (model name) 0.8981 0.0 dell 40wh standard charger type m5y1k 14.8v (attributes) 1.0 0.2105 Lexical > Semantic Lexical < Semantic
  34. Results - NDCG@100 by query type Query Type Lexical Search

    Semantic Search Number of test queries Short Query 0.5403 0.4707 5878 Long Query 0.3262 0.5580 452 Contains Non-Alphabet 0.4267 0.4961 4096 Negation 0.4322 0.5195 1624 Parse Pattern 0.6184 0.6285 2250 Parse Pattern: Queries with some linguistic complexity, extracted using regular expressions
  35. Results are greatly affected by the proportion of the query

    type Manipulating the proportion to win (?) Real Manipulated
  36. • The potential of dense retrieval models is not realized

    ◦ Multi-Modality ◦ Session information ◦ Buyer preferences ◦ ☝ There is no dataset with these features, so we can’t compare • Two systems have different pros and cons ◦ Use them together? Is Semantic Search useless in product search?
  37. • Background • Lexical Search vs. Semantic Search • Integration

    of Semantic Search • Development of Semantic Search Agenda
  38. Lexical Inverted Index Semantic HNSW Rank Fusion Separated UI Phase

    1 Re-Ranking 3 ways to utilize embeddings score = lexical_score * semantic_score
  39. Phase 1 Re-Ranking Lexical Inverted Index Shard 1 Lexical Shard

    2 { “title”: “...”, “item_vector”: [...] } { "query": { "function_score": { "query": { ... } } }, "rescore": { "window_size": 1000, "query": { "score_mode": "multiply", "rescore_query": { "function_score": { "script_score": { "script": { "params": { "query_vector": [-0.33154297, 0.03274536, 2.1914062, ...] }, "source": "cosineSimilarity(params.query_vector, item_vector)" } } } } • Compute semantic score on each shard • Overcome the limitation “Lexical Search just counts word occurrences” • Items that don’t lexically match can’t be retrieved (Recall remains unchanged) ANN index is not required
  40. • Make two requests and display results in different UI

    sections like YouTube • Add diversity to SERPs but the mainstream search results don’t change • Low risk, limited gain • Having many options increases cognitive load Separated UI Components
  41. Rank Fusion is the way to go? Source 1 Relevant

    Relevant Relevant Less Relevant Source 1 Source 2 • Cherry-picking the best results from multiple sources (Skimming Effect) • It requires low cognitive load • Recall issue is addressed
  42. Rank Fusion How results from different systems are fused? Lexical

    Inverted Index Semantic HNSW Search Engine ? Rank Fusion is executed before sending results back to the client Coordinating Node
  43. • Product A: 100 from lexical search • Product B:

    0.9 from semantic search Which product should be ranked higher? ? Lexical Inverted Index Semantic HNSW
  44. Rank Fusion Formulation score = f(α * norm(score_lex), (1 -

    α) * norm(score_sem)) 1. Normalize scores (Reciprocal Rank, Min-Max, Borda count, …) 3. Combine results (SUM or MAX) 0.0 <= score <= 1 score *= α score *= (1 - α) 2. Weight results by α (convex combination) 0.0 <= score (unbounded)
  45. We want the scores to fall within a specific range

    • RR (Reciprocal Rank): ◦ 1 / (k + score), where ▪ k is a constant (often 60) that determines the degree of top-heaviness • TMM (Theoretical Min-Max): ◦ (score - score_min) / (score_max - score_min), where ▪ score_max = the score of the top-ranked result ▪ score_min = 0 • Borda Count: ◦ (N + 1 - score), where ▪ N = the number of results in the list 1. Normalize (Transform) scores TMM is said to be better in theory, but in practice, there is no significant difference (IMO)
  46. 2. Weight scores α = 0.52 α = 0.5 α

    = 0.48 How NDCG changes with different α When RRF and there is no overlap between systems ↓ RR depends on the rank = Scores of the same rank will be the same
  47. • The choice of the merge function ◦ Sum: If

    everyone says “Yes”, it should be more relevant (Chorus Effect) ◦ Max: If an expert says “Yes”, it should be more relevant (Dark Horse Effect) • References ◦ An Analysis of Fusion Functions for Hybrid Retrieval (2023) ◦ Who's #1?: The Science of Rating and Ranking 3. Combine results
  48. Rank Fusion Method NDCG@100 Precision@100 Recall@100 Total Hits Rank Fusion

    1 (Min-Max) 0.5945 0.0669 0.5667 392 Rank Fusion 2 (RRF) 0.5944 0.0719 0.6079 392 Rank Fusion 3 (Borda) 0.5935 0.0718 0.6074 392 Fuse top 5 semantic results 0.5840 0.1963 0.4400 380 Lexical Search 0.5394 0.1952 0.4368 1564 Semantic Search 0.5328 0.0651 0.5456 100 • α is fine-tuned in advance • Small k from Semantic Search appears to be more practical ◦ High NDCG, decent Precision Rank Fusion 1: score = 0.56 * MM(score_lex) + (1 - 0.56) * MM(score_sem))
  49. Limitation of Linear Rank Fusion Items for “I want a

    long jacket that comes to …” may not be found in Lexical Search SUM selects items from the top right lexical score: 0.51 + semantic score: 0.51 semantic score: 1.0 >
  50. • Background • Lexical Search vs. Semantic Search • Integration

    of Semantic Search • Development of Semantic Search Agenda
  51. System Overview (simplified) Search Engine PubSub Search Engine Reranker Embedder

    Indexer Retriever Clients GPUs LLMs BigQuery Datasets
  52. Relevant Irrelevant Retrieved TP FP Not Retrieved FN TN Implicit/Explicit

    Relevance Judgements Random Items Search Logs Evaluation using search logs makes the comparison unfair for Semantic Search Emphasizes lexical matching signals Rerankers can be trained using search logs but…
  53. Obtaining Labels (False Negatives) • Hire annotators? ◦ Explicitly ask

    whether an item is relevant or not ▪ Relevance judgement is subjective ◦ Google has 170 pages of guidelines for annotators • LLM as a Judge ◦ “LLM labellers can do better on this task than human labellers for a fraction of the cost” (link) • Debiasing dataset is still important ◦ Recall the importance of the proportion
  54. Obtaining Labels (All) Unseen Item Set ナイキ → Not engaged

    → Not engaged → Engaged → Not engaged → Engaged ナイキ Semantic Search Results adidas FP TP FN TN nike shoes
  55. Encoding items at index time • With the model, the

    job becomes 7x slower, which is unacceptable ↓ • Developed a new service (Triton) ◦ Dozens of nodes with GPUs
  56. Retrieving similar items Average Response Time (ms) Throughput (query/sec) Elasticsearch

    8.12 (1 shard, int8_hnsw) 235 42.01 Vespa 8.287.20 10 812.73 Qdrant 1.3 (Scalar Quantization) 18 507.94 on a benchmark dataset containing 15,000,000 docs
  57. • Evaluating retrieval = E2E testing is costly • Especially,

    larger models incur higher costs ◦ It slows down the iteration speed and prevents us from getting rapid feedback ◦ 4 hours (384D) → 10 hours (768D) Offline Evaluation is costly Method Dimension Num Params Ranking NDCG@100 sentence-transformers/multi-qa-mpnet-base-dot-v1 768 109M 0.9197 sentence-transformers/all-MiniLM-L6-v2 384 22.7M 0.9164 sentence-transformers/msmarco-distilbert-base-v4 768 66.4M 0.9123
  58. • The maintenance and execution costs (GPUs, LLMs, …) are

    significantly high • Semantic Search may not be effective for all queries Potential Gains vs. Costs (ROI) Lexical Inverted Index Semantic ANN Index Can we simplify Lexical Search? Is it possible to reduce the costs?
  59. • Opportunity size ◦ Phase 1 Reranking ◦ Separated UI

    ◦ Rank Fusion Potential Gains vs. Costs (ROI) • Costs ◦ Latency ◦ Throughput ◦ Cost / 1M queries ◦ GPU cost / month ◦ Cost for dataset creation ◦ Engineering cost
  60. • Opportunity size ◦ Phase 1 Reranking ◦ Separated UI

    ◦ Rank Fusion Potential Gains vs. Costs (ROI) • Costs ◦ Latency ◦ Throughput ◦ Cost / 1M query ◦ GPU cost / month ◦ Cost for dataset creation ◦ Engineering cost Optimal model might vary depending on the use case Which search engine should we adopt? What is the best way to show semantically matching items? The cost of obtaining labels remain high Latency is increasing Do we need another search team dedicated to semantic search? GPU shortage Utilizing visual features is great but significantly slows down interation What’s the point if users don’t want semantic search? How can we measure the opportunity size? Bias in dataset
  61. User interviews revealed that when users can’t find appropriate items

    on our platform, they switch to Google. It highlights the importance of investing in new technology to stay competitive. “Keyword search is something that old people do” Query: “product attribute attribute…” → “What I want is …”
  62. Summary • Semantic Search is not a silver bullet •

    Practical Approach ◦ Understand the user behavior ◦ Choose the right metrics ◦ Optimize the system while considering costs • Basics are still important in the era of AI ◦ Just as Lexical Search can’t be replaced by Semantic Search, search engineers can’t be replaced by AI engineers (for now) rejasupotaro rejasupotaro Any thoughts? 👉
  63. • Thanks to Sho Yokoi at Tohoku University for his

    invaluable advice on experimental setup and modeling Aknowledgement