A Practical Approach To Semantic Search

A Practical Approach To Semantic Search How can embeddings improve
your product? Kentaro Takiguchi @Berlin Buzzwords 2024

About Me and Company • Search Relevance Engineer (Linguistics and
Evaluation) • Mercari is a Japanese two-sided marketplace platform ◦ Users buy and sell their items ≈ Product Search

• Question Answering ◦ Query: “What are some examples of
unsecured loans?” ◦ Doc: “Auto loans are secured against the car. "Signature" loans, from a bank that knows and trusts you, are typically unsecured. unsecured loans other than informal ones or these are fairly rare. Most lenders don't want to take the additional risk, or balance that risk with a high enough interest rate to make the unsecured loan unattractive.” • Product Search ◦ Query: “headphones noise-cancelling” ◦ Doc: “bluetooth noise-cancelling headphones 60H playtime High Res Audio” How (Lexical) (Product) Search Works System design varies depending on the domain Each query term acts as a ﬁlter condition Retrieve the doc that satisﬁes the central aspect Users expect the product to satisfy all the requirements

• Question Answering ◦ The extent to which the answer
pertains to the topic of the question ◦ Merely rely on term overlap, it is not necessary for all the query terms to appear in the doc • Product Search ◦ How well the item satisfies the given specifications ◦ Query terms are used to narrow down the search space Relevance Definition

e.g. Query: “nike shoes” nike shoes Exact (A ∧ B)
Substitute (A ∨ B) Irrelevant ¬(A ∨ B) Note: Each ﬁlter condition is not necessarily a single term • “nike shoes” 👉 Exact • “adidas shoes” 👉 Substitute • “hair dryer” 👉 Irrelevant

Lexical Search is dying? https://arxiv.org/abs/2004.04906 https://arxiv.org/abs/2010.10469

What’s wrong with “Traditional Search”? Query Challenges “women’s running shoe”
woman vs. women, shoe vs. shoes “headphones not wierless” negation + spelling error “christmas pjs for toddlers” pjs = pajamas, toddlers = kids, children “affordable laptops with good battery life” subjective terms Fragility to spelling variants Linguistic Ambiguity Lack of understanding of semantic relationships well-known typical issues, solutions do exist

“Cat tower for cats with low-mobility” → Accessible and comfortable
for cats who may have difficulty climbing or jumping → • Low Platforms • Gradual Steps or Ramps • Soft Surface Hidden requirements don’t appear in the query text

Modern Lexical Search is complex Tokenizer Lexical Search (Subcomponents +
Heuristics) Scoring Function (Field Weights, Boosting Functions) … Search Infrastructure Spelling Corrector Ontology Classiﬁer Each component requires expertise

• The system can’t be optimized directly ◦ The index
is not differenciable • Add more subcomponents ◦ How to measure the effectiveness of the subcomponents? ◦ Are their offline metrics correlated well with online metrics? Why Lexical Search becomes complex? Objective Synonym Dictionary Lexical Search ? The number of synonyms ∝ Revenue?

Semantic Search is much simpler Query Encoder Expected Input (Query)
Doc Encoder ⨀ Ideal Output (Doc) Addressing subproblems → Directly optimizing the system Simpler, cheaper, and more effective (???)

• Q. Is Semantic Search a Hype? ◦ 👉 Lexical
Search vs. Semantic Search? • Q. How can we incorporate semantic matching signals? ◦ 👉 Integration of Semantic Search • Q. Is Semantic Search cost-effective? ◦ 👉 Development of Semantic Search Questions to explore

• Background • Lexical Search vs. Semantic Search • Integration
of Semantic Search • Development of Semantic Search Agenda

Is it a fair comparison? Task setting? Evaluation metrics? Dataset?
• Public benchmark results may not be entirely reliable • Their experiments are not designed for your domain • You need to verify the effectiveness yourself Implementation?

Publicly Available Datasets Dataset Task Median Query Length Median Doc
Length Number of test docs Number of test queries Amazon ESCI Product Search 4 137 482105 8956 FiQA Question Answering 10 90 57638 648 DBPedia Entity Retrieval 5 47 4635922 400 Quora Duplicate Question Retrieval 9 10 522931 10000 https://opensearch.org/blog/semantic-science-benchmarks/

• Product Catalog ◦ title, brand, color, description, … •
Relevance Judgements ◦ Exact > Substitute > Complement > Irrelevant ESCI Dataset https://github.com/amazon-science/esci-data Exact Substitute

• Choose appropriate metrics • Rank-based Metrics ◦ NDCG vs.
MRR vs. MAP • Rank-based Metrics vs. Set-based Metrics ◦ NDCG vs. Precision & Recall “Fair” Evaluation ? Appropriate metrics = Metrics correlated well with online metrics

• Gain decays from top to bottom ◦ The value
of relevant item at 1st >>> at 50th Rank-based = Top-heaviness How gain decays How attention decays (click position distribution)

• MAP (Binary) ◦ Relevance or Irrelevant • NDCG (Graded)
◦ Purchase > Like > Click > Not clicked MAP (Binary) vs. NDCG (Graded)

MRR (First) vs. NDCG (Cumulative) MRR (First) NDCG (Cumulative)

MRR vs. NDCG • MRR: 0.5 • NDCG: 0.6309 •
MRR: 0.25 • NDCG: 0.885 MRR doesn’t account for the presence of multiple relevant items

Measuring NDCG alone is sufﬁcient? • NDCG: 0.8585 • Precision:
0.1167 • NDCG: 0.8564 • Precision: 0.8833

What if we have nothing in our inventory? • NDCG:
1.0 • Precision: 0.0167 • NDCG: 0.6309 • Precision: 0.5

Interpretation NDCG Precision Recall 1. Ideal High High High 2.
Great High High Low 3. Acceptable Low High High 4. Unacceptable High Low High - Low Low Low NDCG: High, Prec: Low → Unacceptable We found that NDCG has a positive correlation with revenue while Precision has a negative correlation with the number of complaints from users

• Users examine many items to ﬁnd a better product
at the lowest price • Users complain when they see irrelevant items Evaluating Product Search • MRR is not appropriate • Small k is insufficient (such as metric@10) • We can’t rely sorely on NDCG, Precision matters a lot

“Fair” Implementation • In the literature ◦ Fine-Tuned Semantic Search
vs. Barely optimized poor BM25 • In reality ◦ Newly built Semantic Search vs. An established search system

Lexical Search Results Method NDCG@100 Precision@100 Recall@100 Total Hits Lexical
(Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Lexical (Combined) 0.5107 0.2362 0.3798 325 Lexical (Field Weights) 0.4896 0.2557 0.3409 259 Lexical (Synonyms) 0.4723 0.2333 0.3632 325 Lexical (Naive) 0.4557 0.2535 0.3285 259 • Combined = Synonym expansion with different weights, ﬁeld weights tuned by Bayesian Optimization, phrase match boost, … • Relaxation = Query Relaxation is applied for low-hit queries

Lexical Search - Query Relaxation A B A’ “black and
brown tan steering wheel cover” “black and brown tan steering wheel cover” • NDCG 📈 • Precision 📉

Semantic Search to compare Query Encoder Query Product Encoder ⨀
Text Encoder (LM) Text Encoder (LM) How to sample pairs? Which ﬁeld to encode? Which loss to use? Thresholding? Features Features Product Which pre-trained model to use? • Incorporating Additional Features: https://arxiv.org/abs/2306.04833 • Multi-Stage Training: https://www.amazon.science/publications/web-scale-semantic-product-search-with-large-language-models Heuristic post-ﬁlters?

Method NDCG@100 Precision@100 Recall@100 Total Hits Semantic (CL) 0.5328 0.0651
0.5456 100 Semantic (CL + Thresholding) 0.5221 0.1284 0.4814 67 Semantic (MLM + CL) 0.5046 0.0613 0.5121 100 Semantic (Naive) 0.4984 0.0561 0.4771 100 Semantic Search Results • MLM: Fine-tuning the LM through Masked Language Modeling • CL: Fine-tuning the bi-endcoer through Contrastive Learning • The optimal or reasonable loss, features, pre-trained model are used

Lexical Search vs. Semantic Search Method NDCG@100 Precision@100 Recall@100 Total
Hits Lexical (Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Semantic (CL) 0.5328 0.0651 0.5456 100 Lexical (Combined) 0.5107 0.2362 0.3798 325 Semantic (Naive) 0.4984 0.0561 0.4771 100 Lexical (Naive) 0.4557 0.2535 0.3285 259 NDCG is almost the same Precision of Semantic Search is low = Unacceptable

• Queries are often given as a set of keywords
◦ Users copy & paste product names • User Survey: ◦ Q. When do you use the platform? ◦ A. When searching for a speciﬁc item to buy (66.4%) This result is not surprising because Query: “product attribute attribute …”

2 1 9 7 3 6 Lexical Search@k 5 4
unseen relevant Semantic Search@k irrelevant 8 1 k items • Found in lexical search: {1,2,3,5,8} → 5 items • Found in semantic search: {3,6,1,9,2,4,7} → 7 items • Found in both: {1,3} → 2 items NDCG is almost the same but what are retrieved? Method NDCG@100 Lexical (Combined + Relaxation) 0.5394 Semantic (CL) 0.5328 3

How similar the two systems are? Products Retrieved from Lexical
Search@100 Products Retrieved from Semantic Search@100 Found in Both • In terms of NDCG, they are “the same” but they retrieve different products • How they differ?

Method NDCG (Lexical) NDCG (Semantic) harley quinn costume for toddler
girls 0.0 0.89 accidental love by gary soto 0.0 1.0 baby boy first birthday cookie montser decorations 0.0 0.8936 I want a long jacket that comes to mid leg in a dark colour and very warm 0.0 0.1673 Method NDCG (Lexical) NDCG (Semantic) “the first 90 days” (book title) 0.6166 0.1934 “summit 470” (model name) 0.8981 0.0 dell 40wh standard charger type m5y1k 14.8v (attributes) 1.0 0.2105 Lexical > Semantic Lexical < Semantic

Results - NDCG@100 by query type Query Type Lexical Search
Semantic Search Number of test queries Short Query 0.5403 0.4707 5878 Long Query 0.3262 0.5580 452 Contains Non-Alphabet 0.4267 0.4961 4096 Negation 0.4322 0.5195 1624 Parse Pattern 0.6184 0.6285 2250 Parse Pattern: Queries with some linguistic complexity, extracted using regular expressions

Results are greatly affected by the proportion of the query
type Manipulating the proportion to win (?) Real Manipulated

• The potential of dense retrieval models is not realized
◦ Multi-Modality ◦ Session information ◦ Buyer preferences ◦ ☝ There is no dataset with these features, so we can’t compare • Two systems have different pros and cons ◦ Use them together? Is Semantic Search useless in product search?

Lexical Inverted Index Semantic HNSW Rank Fusion Separated UI Phase
1 Re-Ranking 3 ways to utilize embeddings score = lexical_score * semantic_score

Phase 1 Re-Ranking Lexical Inverted Index Shard 1 Lexical Shard
2 { “title”: “...”, “item_vector”: [...] } { "query": { "function_score": { "query": { ... } } }, "rescore": { "window_size": 1000, "query": { "score_mode": "multiply", "rescore_query": { "function_score": { "script_score": { "script": { "params": { "query_vector": [-0.33154297, 0.03274536, 2.1914062, ...] }, "source": "cosineSimilarity(params.query_vector, item_vector)" } } } } • Compute semantic score on each shard • Overcome the limitation “Lexical Search just counts word occurrences” • Items that don’t lexically match can’t be retrieved (Recall remains unchanged) ANN index is not required

• Make two requests and display results in different UI
sections like YouTube • Add diversity to SERPs but the mainstream search results don’t change • Low risk, limited gain • Having many options increases cognitive load Separated UI Components

Rank Fusion is the way to go? Source 1 Relevant
Relevant Relevant Less Relevant Source 1 Source 2 • Cherry-picking the best results from multiple sources (Skimming Effect) • It requires low cognitive load • Recall issue is addressed

Rank Fusion How results from different systems are fused? Lexical
Inverted Index Semantic HNSW Search Engine ? Rank Fusion is executed before sending results back to the client Coordinating Node

• Product A: 100 from lexical search • Product B:
0.9 from semantic search Which product should be ranked higher? ? Lexical Inverted Index Semantic HNSW

Rank Fusion Formulation score = f(α * norm(score_lex), (1 -
α) * norm(score_sem)) 1. Normalize scores (Reciprocal Rank, Min-Max, Borda count, …) 3. Combine results (SUM or MAX) 0.0 <= score <= 1 score *= α score *= (1 - α) 2. Weight results by α (convex combination) 0.0 <= score (unbounded)

We want the scores to fall within a speciﬁc range
• RR (Reciprocal Rank): ◦ 1 / (k + score), where ▪ k is a constant (often 60) that determines the degree of top-heaviness • TMM (Theoretical Min-Max): ◦ (score - score_min) / (score_max - score_min), where ▪ score_max = the score of the top-ranked result ▪ score_min = 0 • Borda Count: ◦ (N + 1 - score), where ▪ N = the number of results in the list 1. Normalize (Transform) scores TMM is said to be better in theory, but in practice, there is no signiﬁcant difference (IMO)

2. Weight scores α = 0.52 α = 0.5 α
= 0.48 How NDCG changes with different α When RRF and there is no overlap between systems ↓ RR depends on the rank = Scores of the same rank will be the same

• The choice of the merge function ◦ Sum: If
everyone says “Yes”, it should be more relevant (Chorus Effect) ◦ Max: If an expert says “Yes”, it should be more relevant (Dark Horse Effect) • References ◦ An Analysis of Fusion Functions for Hybrid Retrieval (2023) ◦ Who's #1?: The Science of Rating and Ranking 3. Combine results

Rank Fusion Method NDCG@100 Precision@100 Recall@100 Total Hits Rank Fusion
1 (Min-Max) 0.5945 0.0669 0.5667 392 Rank Fusion 2 (RRF) 0.5944 0.0719 0.6079 392 Rank Fusion 3 (Borda) 0.5935 0.0718 0.6074 392 Fuse top 5 semantic results 0.5840 0.1963 0.4400 380 Lexical Search 0.5394 0.1952 0.4368 1564 Semantic Search 0.5328 0.0651 0.5456 100 • α is ﬁne-tuned in advance • Small k from Semantic Search appears to be more practical ◦ High NDCG, decent Precision Rank Fusion 1: score = 0.56 * MM(score_lex) + (1 - 0.56) * MM(score_sem))

Limitation of Linear Rank Fusion Items for “I want a
long jacket that comes to …” may not be found in Lexical Search SUM selects items from the top right lexical score: 0.51 + semantic score: 0.51 semantic score: 1.0 >

Lexical Inverted Index Semantic HNSW Candidate Set Global Reranker Two
Requests → Global Reranking

System Overview (simpliﬁed) Search Engine PubSub Search Engine Reranker Embedder
Indexer Retriever Clients GPUs LLMs BigQuery Datasets

Relevant Irrelevant Retrieved TP FP Not Retrieved FN TN Implicit/Explicit
Relevance Judgements Random Items Search Logs Evaluation using search logs makes the comparison unfair for Semantic Search Emphasizes lexical matching signals Rerankers can be trained using search logs but…

Obtaining Labels (False Negatives) • Hire annotators? ◦ Explicitly ask
whether an item is relevant or not ▪ Relevance judgement is subjective ◦ Google has 170 pages of guidelines for annotators • LLM as a Judge ◦ “LLM labellers can do better on this task than human labellers for a fraction of the cost” (link) • Debiasing dataset is still important ◦ Recall the importance of the proportion

Obtaining Labels (All) Unseen Item Set ナイキ → Not engaged
→ Not engaged → Engaged → Not engaged → Engaged ナイキ Semantic Search Results adidas FP TP FN TN nike shoes

amazon ﬁre stick tv LLM’s output has a high correlation
with human labels

Encoding items at index time • With the model, the
job becomes 7x slower, which is unacceptable ↓ • Developed a new service (Triton) ◦ Dozens of nodes with GPUs

Retrieving similar items Average Response Time (ms) Throughput (query/sec) Elasticsearch
8.12 (1 shard, int8_hnsw) 235 42.01 Vespa 8.287.20 10 812.73 Qdrant 1.3 (Scalar Quantization) 18 507.94 on a benchmark dataset containing 15,000,000 docs

• Evaluating retrieval = E2E testing is costly • Especially,
larger models incur higher costs ◦ It slows down the iteration speed and prevents us from getting rapid feedback ◦ 4 hours (384D) → 10 hours (768D) Ofﬂine Evaluation is costly Method Dimension Num Params Ranking NDCG@100 sentence-transformers/multi-qa-mpnet-base-dot-v1 768 109M 0.9197 sentence-transformers/all-MiniLM-L6-v2 384 22.7M 0.9164 sentence-transformers/msmarco-distilbert-base-v4 768 66.4M 0.9123

• The maintenance and execution costs (GPUs, LLMs, …) are
signiﬁcantly high • Semantic Search may not be effective for all queries Potential Gains vs. Costs (ROI) Lexical Inverted Index Semantic ANN Index Can we simplify Lexical Search? Is it possible to reduce the costs?

• Opportunity size ◦ Phase 1 Reranking ◦ Separated UI
◦ Rank Fusion Potential Gains vs. Costs (ROI) • Costs ◦ Latency ◦ Throughput ◦ Cost / 1M queries ◦ GPU cost / month ◦ Cost for dataset creation ◦ Engineering cost

• Opportunity size ◦ Phase 1 Reranking ◦ Separated UI
◦ Rank Fusion Potential Gains vs. Costs (ROI) • Costs ◦ Latency ◦ Throughput ◦ Cost / 1M query ◦ GPU cost / month ◦ Cost for dataset creation ◦ Engineering cost Optimal model might vary depending on the use case Which search engine should we adopt? What is the best way to show semantically matching items? The cost of obtaining labels remain high Latency is increasing Do we need another search team dedicated to semantic search? GPU shortage Utilizing visual features is great but signiﬁcantly slows down interation What’s the point if users don’t want semantic search? How can we measure the opportunity size? Bias in dataset

User interviews revealed that when users can’t ﬁnd appropriate items
on our platform, they switch to Google. It highlights the importance of investing in new technology to stay competitive. “Keyword search is something that old people do” Query: “product attribute attribute…” → “What I want is …”

Summary

Summary • Semantic Search is not a silver bullet •
Practical Approach ◦ Understand the user behavior ◦ Choose the right metrics ◦ Optimize the system while considering costs • Basics are still important in the era of AI ◦ Just as Lexical Search can’t be replaced by Semantic Search, search engineers can’t be replaced by AI engineers (for now) rejasupotaro rejasupotaro Any thoughts? 👉

• Thanks to Sho Yokoi at Tohoku University for his
invaluable advice on experimental setup and modeling Aknowledgement

A Practical Approach To Semantic Search

A Practical Approach To Semantic Search

More Decks by rejasupotaro

Featured

Transcript