Slide 1

Slide 1 text

A Practical Approach To Semantic Search How can embeddings improve your product? Kentaro Takiguchi @Berlin Buzzwords 2024

Slide 2

Slide 2 text

About Me and Company ● Search Relevance Engineer (Linguistics and Evaluation) ● Mercari is a Japanese two-sided marketplace platform ○ Users buy and sell their items ≈ Product Search

Slide 3

Slide 3 text

● Question Answering ○ Query: “What are some examples of unsecured loans?” ○ Doc: “Auto loans are secured against the car. "Signature" loans, from a bank that knows and trusts you, are typically unsecured. unsecured loans other than informal ones or these are fairly rare. Most lenders don't want to take the additional risk, or balance that risk with a high enough interest rate to make the unsecured loan unattractive.” ● Product Search ○ Query: “headphones noise-cancelling” ○ Doc: “bluetooth noise-cancelling headphones 60H playtime High Res Audio” How (Lexical) (Product) Search Works System design varies depending on the domain Each query term acts as a filter condition Retrieve the doc that satisfies the central aspect Users expect the product to satisfy all the requirements

Slide 4

Slide 4 text

● Question Answering ○ The extent to which the answer pertains to the topic of the question ○ Merely rely on term overlap, it is not necessary for all the query terms to appear in the doc ● Product Search ○ How well the item satisfies the given specifications ○ Query terms are used to narrow down the search space Relevance Definition

Slide 5

Slide 5 text

e.g. Query: “nike shoes” nike shoes Exact (A ∧ B) Substitute (A ∨ B) Irrelevant ¬(A ∨ B) Note: Each filter condition is not necessarily a single term ● “nike shoes” 👉 Exact ● “adidas shoes” 👉 Substitute ● “hair dryer” 👉 Irrelevant

Slide 6

Slide 6 text

Lexical Search is dying? https://arxiv.org/abs/2004.04906 https://arxiv.org/abs/2010.10469

Slide 7

Slide 7 text

What’s wrong with “Traditional Search”? Query Challenges “women’s running shoe” woman vs. women, shoe vs. shoes “headphones not wierless” negation + spelling error “christmas pjs for toddlers” pjs = pajamas, toddlers = kids, children “affordable laptops with good battery life” subjective terms Fragility to spelling variants Linguistic Ambiguity Lack of understanding of semantic relationships well-known typical issues, solutions do exist

Slide 8

Slide 8 text

“Cat tower for cats with low-mobility” → Accessible and comfortable for cats who may have difficulty climbing or jumping → ● Low Platforms ● Gradual Steps or Ramps ● Soft Surface Hidden requirements don’t appear in the query text

Slide 9

Slide 9 text

Modern Lexical Search is complex Tokenizer Lexical Search (Subcomponents + Heuristics) Scoring Function (Field Weights, Boosting Functions) … Search Infrastructure Spelling Corrector Ontology Classifier Each component requires expertise

Slide 10

Slide 10 text

● The system can’t be optimized directly ○ The index is not differenciable ● Add more subcomponents ○ How to measure the effectiveness of the subcomponents? ○ Are their offline metrics correlated well with online metrics? Why Lexical Search becomes complex? Objective Synonym Dictionary Lexical Search ? The number of synonyms ∝ Revenue?

Slide 11

Slide 11 text

Semantic Search is much simpler Query Encoder Expected Input (Query) Doc Encoder ⨀ Ideal Output (Doc) Addressing subproblems → Directly optimizing the system Simpler, cheaper, and more effective (???)

Slide 12

Slide 12 text

● Q. Is Semantic Search a Hype? ○ 👉 Lexical Search vs. Semantic Search? ● Q. How can we incorporate semantic matching signals? ○ 👉 Integration of Semantic Search ● Q. Is Semantic Search cost-effective? ○ 👉 Development of Semantic Search Questions to explore

Slide 13

Slide 13 text

● Background ● Lexical Search vs. Semantic Search ● Integration of Semantic Search ● Development of Semantic Search Agenda

Slide 14

Slide 14 text

Is it a fair comparison? Task setting? Evaluation metrics? Dataset? ● Public benchmark results may not be entirely reliable ● Their experiments are not designed for your domain ● You need to verify the effectiveness yourself Implementation?

Slide 15

Slide 15 text

Publicly Available Datasets Dataset Task Median Query Length Median Doc Length Number of test docs Number of test queries Amazon ESCI Product Search 4 137 482105 8956 FiQA Question Answering 10 90 57638 648 DBPedia Entity Retrieval 5 47 4635922 400 Quora Duplicate Question Retrieval 9 10 522931 10000 https://opensearch.org/blog/semantic-science-benchmarks/

Slide 16

Slide 16 text

● Product Catalog ○ title, brand, color, description, … ● Relevance Judgements ○ Exact > Substitute > Complement > Irrelevant ESCI Dataset https://github.com/amazon-science/esci-data Exact Substitute

Slide 17

Slide 17 text

● Choose appropriate metrics ● Rank-based Metrics ○ NDCG vs. MRR vs. MAP ● Rank-based Metrics vs. Set-based Metrics ○ NDCG vs. Precision & Recall “Fair” Evaluation ? Appropriate metrics = Metrics correlated well with online metrics

Slide 18

Slide 18 text

● Gain decays from top to bottom ○ The value of relevant item at 1st >>> at 50th Rank-based = Top-heaviness How gain decays How attention decays (click position distribution)

Slide 19

Slide 19 text

● MAP (Binary) ○ Relevance or Irrelevant ● NDCG (Graded) ○ Purchase > Like > Click > Not clicked MAP (Binary) vs. NDCG (Graded)

Slide 20

Slide 20 text

MRR (First) vs. NDCG (Cumulative) MRR (First) NDCG (Cumulative)

Slide 21

Slide 21 text

MRR vs. NDCG ● MRR: 0.5 ● NDCG: 0.6309 ● MRR: 0.25 ● NDCG: 0.885 MRR doesn’t account for the presence of multiple relevant items

Slide 22

Slide 22 text

Measuring NDCG alone is sufficient? ● NDCG: 0.8585 ● Precision: 0.1167 ● NDCG: 0.8564 ● Precision: 0.8833

Slide 23

Slide 23 text

What if we have nothing in our inventory? ● NDCG: 1.0 ● Precision: 0.0167 ● NDCG: 0.6309 ● Precision: 0.5

Slide 24

Slide 24 text

Interpretation NDCG Precision Recall 1. Ideal High High High 2. Great High High Low 3. Acceptable Low High High 4. Unacceptable High Low High - Low Low Low NDCG: High, Prec: Low → Unacceptable We found that NDCG has a positive correlation with revenue while Precision has a negative correlation with the number of complaints from users

Slide 25

Slide 25 text

● Users examine many items to find a better product at the lowest price ● Users complain when they see irrelevant items Evaluating Product Search ● MRR is not appropriate ● Small k is insufficient (such as metric@10) ● We can’t rely sorely on NDCG, Precision matters a lot

Slide 26

Slide 26 text

“Fair” Implementation ● In the literature ○ Fine-Tuned Semantic Search vs. Barely optimized poor BM25 ● In reality ○ Newly built Semantic Search vs. An established search system

Slide 27

Slide 27 text

Lexical Search Results Method NDCG@100 Precision@100 Recall@100 Total Hits Lexical (Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Lexical (Combined) 0.5107 0.2362 0.3798 325 Lexical (Field Weights) 0.4896 0.2557 0.3409 259 Lexical (Synonyms) 0.4723 0.2333 0.3632 325 Lexical (Naive) 0.4557 0.2535 0.3285 259 ● Combined = Synonym expansion with different weights, field weights tuned by Bayesian Optimization, phrase match boost, … ● Relaxation = Query Relaxation is applied for low-hit queries

Slide 28

Slide 28 text

Lexical Search - Query Relaxation A B A’ “black and brown tan steering wheel cover” “black and brown tan steering wheel cover” ● NDCG 📈 ● Precision 📉

Slide 29

Slide 29 text

Semantic Search to compare Query Encoder Query Product Encoder ⨀ Text Encoder (LM) Text Encoder (LM) How to sample pairs? Which field to encode? Which loss to use? Thresholding? Features Features Product Which pre-trained model to use? ● Incorporating Additional Features: https://arxiv.org/abs/2306.04833 ● Multi-Stage Training: https://www.amazon.science/publications/web-scale-semantic-product-search-with-large-language-models Heuristic post-filters?

Slide 30

Slide 30 text

Method NDCG@100 Precision@100 Recall@100 Total Hits Semantic (CL) 0.5328 0.0651 0.5456 100 Semantic (CL + Thresholding) 0.5221 0.1284 0.4814 67 Semantic (MLM + CL) 0.5046 0.0613 0.5121 100 Semantic (Naive) 0.4984 0.0561 0.4771 100 Semantic Search Results ● MLM: Fine-tuning the LM through Masked Language Modeling ● CL: Fine-tuning the bi-endcoer through Contrastive Learning ● The optimal or reasonable loss, features, pre-trained model are used

Slide 31

Slide 31 text

Lexical Search vs. Semantic Search Method NDCG@100 Precision@100 Recall@100 Total Hits Lexical (Combined + Relaxation) 0.5394 0.1952 0.4368 1564 Semantic (CL) 0.5328 0.0651 0.5456 100 Lexical (Combined) 0.5107 0.2362 0.3798 325 Semantic (Naive) 0.4984 0.0561 0.4771 100 Lexical (Naive) 0.4557 0.2535 0.3285 259 NDCG is almost the same Precision of Semantic Search is low = Unacceptable

Slide 32

Slide 32 text

● Queries are often given as a set of keywords ○ Users copy & paste product names ● User Survey: ○ Q. When do you use the platform? ○ A. When searching for a specific item to buy (66.4%) This result is not surprising because Query: “product attribute attribute …”

Slide 33

Slide 33 text

2 1 9 7 3 6 Lexical Search@k 5 4 unseen relevant Semantic Search@k irrelevant 8 1 k items ● Found in lexical search: {1,2,3,5,8} → 5 items ● Found in semantic search: {3,6,1,9,2,4,7} → 7 items ● Found in both: {1,3} → 2 items NDCG is almost the same but what are retrieved? Method NDCG@100 Lexical (Combined + Relaxation) 0.5394 Semantic (CL) 0.5328 3

Slide 34

Slide 34 text

How similar the two systems are? Products Retrieved from Lexical Search@100 Products Retrieved from Semantic Search@100 Found in Both ● In terms of NDCG, they are “the same” but they retrieve different products ● How they differ?

Slide 35

Slide 35 text

Method NDCG (Lexical) NDCG (Semantic) harley quinn costume for toddler girls 0.0 0.89 accidental love by gary soto 0.0 1.0 baby boy first birthday cookie montser decorations 0.0 0.8936 I want a long jacket that comes to mid leg in a dark colour and very warm 0.0 0.1673 Method NDCG (Lexical) NDCG (Semantic) “the first 90 days” (book title) 0.6166 0.1934 “summit 470” (model name) 0.8981 0.0 dell 40wh standard charger type m5y1k 14.8v (attributes) 1.0 0.2105 Lexical > Semantic Lexical < Semantic

Slide 36

Slide 36 text

Results - NDCG@100 by query type Query Type Lexical Search Semantic Search Number of test queries Short Query 0.5403 0.4707 5878 Long Query 0.3262 0.5580 452 Contains Non-Alphabet 0.4267 0.4961 4096 Negation 0.4322 0.5195 1624 Parse Pattern 0.6184 0.6285 2250 Parse Pattern: Queries with some linguistic complexity, extracted using regular expressions

Slide 37

Slide 37 text

Results are greatly affected by the proportion of the query type Manipulating the proportion to win (?) Real Manipulated

Slide 38

Slide 38 text

● The potential of dense retrieval models is not realized ○ Multi-Modality ○ Session information ○ Buyer preferences ○ ☝ There is no dataset with these features, so we can’t compare ● Two systems have different pros and cons ○ Use them together? Is Semantic Search useless in product search?

Slide 39

Slide 39 text

● Background ● Lexical Search vs. Semantic Search ● Integration of Semantic Search ● Development of Semantic Search Agenda

Slide 40

Slide 40 text

Lexical Inverted Index Semantic HNSW Rank Fusion Separated UI Phase 1 Re-Ranking 3 ways to utilize embeddings score = lexical_score * semantic_score

Slide 41

Slide 41 text

Phase 1 Re-Ranking Lexical Inverted Index Shard 1 Lexical Shard 2 { “title”: “...”, “item_vector”: [...] } { "query": { "function_score": { "query": { ... } } }, "rescore": { "window_size": 1000, "query": { "score_mode": "multiply", "rescore_query": { "function_score": { "script_score": { "script": { "params": { "query_vector": [-0.33154297, 0.03274536, 2.1914062, ...] }, "source": "cosineSimilarity(params.query_vector, item_vector)" } } } } ● Compute semantic score on each shard ● Overcome the limitation “Lexical Search just counts word occurrences” ● Items that don’t lexically match can’t be retrieved (Recall remains unchanged) ANN index is not required

Slide 42

Slide 42 text

● Make two requests and display results in different UI sections like YouTube ● Add diversity to SERPs but the mainstream search results don’t change ● Low risk, limited gain ● Having many options increases cognitive load Separated UI Components

Slide 43

Slide 43 text

Rank Fusion is the way to go? Source 1 Relevant Relevant Relevant Less Relevant Source 1 Source 2 ● Cherry-picking the best results from multiple sources (Skimming Effect) ● It requires low cognitive load ● Recall issue is addressed

Slide 44

Slide 44 text

Rank Fusion How results from different systems are fused? Lexical Inverted Index Semantic HNSW Search Engine ? Rank Fusion is executed before sending results back to the client Coordinating Node

Slide 45

Slide 45 text

● Product A: 100 from lexical search ● Product B: 0.9 from semantic search Which product should be ranked higher? ? Lexical Inverted Index Semantic HNSW

Slide 46

Slide 46 text

Rank Fusion Formulation score = f(α * norm(score_lex), (1 - α) * norm(score_sem)) 1. Normalize scores (Reciprocal Rank, Min-Max, Borda count, …) 3. Combine results (SUM or MAX) 0.0 <= score <= 1 score *= α score *= (1 - α) 2. Weight results by α (convex combination) 0.0 <= score (unbounded)

Slide 47

Slide 47 text

We want the scores to fall within a specific range ● RR (Reciprocal Rank): ○ 1 / (k + score), where ■ k is a constant (often 60) that determines the degree of top-heaviness ● TMM (Theoretical Min-Max): ○ (score - score_min) / (score_max - score_min), where ■ score_max = the score of the top-ranked result ■ score_min = 0 ● Borda Count: ○ (N + 1 - score), where ■ N = the number of results in the list 1. Normalize (Transform) scores TMM is said to be better in theory, but in practice, there is no significant difference (IMO)

Slide 48

Slide 48 text

2. Weight scores α = 0.52 α = 0.5 α = 0.48 How NDCG changes with different α When RRF and there is no overlap between systems ↓ RR depends on the rank = Scores of the same rank will be the same

Slide 49

Slide 49 text

● The choice of the merge function ○ Sum: If everyone says “Yes”, it should be more relevant (Chorus Effect) ○ Max: If an expert says “Yes”, it should be more relevant (Dark Horse Effect) ● References ○ An Analysis of Fusion Functions for Hybrid Retrieval (2023) ○ Who's #1?: The Science of Rating and Ranking 3. Combine results

Slide 50

Slide 50 text

Rank Fusion Method NDCG@100 Precision@100 Recall@100 Total Hits Rank Fusion 1 (Min-Max) 0.5945 0.0669 0.5667 392 Rank Fusion 2 (RRF) 0.5944 0.0719 0.6079 392 Rank Fusion 3 (Borda) 0.5935 0.0718 0.6074 392 Fuse top 5 semantic results 0.5840 0.1963 0.4400 380 Lexical Search 0.5394 0.1952 0.4368 1564 Semantic Search 0.5328 0.0651 0.5456 100 ● α is fine-tuned in advance ● Small k from Semantic Search appears to be more practical ○ High NDCG, decent Precision Rank Fusion 1: score = 0.56 * MM(score_lex) + (1 - 0.56) * MM(score_sem))

Slide 51

Slide 51 text

Limitation of Linear Rank Fusion Items for “I want a long jacket that comes to …” may not be found in Lexical Search SUM selects items from the top right lexical score: 0.51 + semantic score: 0.51 semantic score: 1.0 >

Slide 52

Slide 52 text

Lexical Inverted Index Semantic HNSW Candidate Set Global Reranker Two Requests → Global Reranking

Slide 53

Slide 53 text

● Background ● Lexical Search vs. Semantic Search ● Integration of Semantic Search ● Development of Semantic Search Agenda

Slide 54

Slide 54 text

System Overview (simplified) Search Engine PubSub Search Engine Reranker Embedder Indexer Retriever Clients GPUs LLMs BigQuery Datasets

Slide 55

Slide 55 text

Relevant Irrelevant Retrieved TP FP Not Retrieved FN TN Implicit/Explicit Relevance Judgements Random Items Search Logs Evaluation using search logs makes the comparison unfair for Semantic Search Emphasizes lexical matching signals Rerankers can be trained using search logs but…

Slide 56

Slide 56 text

Obtaining Labels (False Negatives) ● Hire annotators? ○ Explicitly ask whether an item is relevant or not ■ Relevance judgement is subjective ○ Google has 170 pages of guidelines for annotators ● LLM as a Judge ○ “LLM labellers can do better on this task than human labellers for a fraction of the cost” (link) ● Debiasing dataset is still important ○ Recall the importance of the proportion

Slide 57

Slide 57 text

Obtaining Labels (All) Unseen Item Set ナイキ → Not engaged → Not engaged → Engaged → Not engaged → Engaged ナイキ Semantic Search Results adidas FP TP FN TN nike shoes

Slide 58

Slide 58 text

amazon fire stick tv LLM’s output has a high correlation with human labels

Slide 59

Slide 59 text

Encoding items at index time ● With the model, the job becomes 7x slower, which is unacceptable ↓ ● Developed a new service (Triton) ○ Dozens of nodes with GPUs

Slide 60

Slide 60 text

Retrieving similar items Average Response Time (ms) Throughput (query/sec) Elasticsearch 8.12 (1 shard, int8_hnsw) 235 42.01 Vespa 8.287.20 10 812.73 Qdrant 1.3 (Scalar Quantization) 18 507.94 on a benchmark dataset containing 15,000,000 docs

Slide 61

Slide 61 text

● Evaluating retrieval = E2E testing is costly ● Especially, larger models incur higher costs ○ It slows down the iteration speed and prevents us from getting rapid feedback ○ 4 hours (384D) → 10 hours (768D) Offline Evaluation is costly Method Dimension Num Params Ranking NDCG@100 sentence-transformers/multi-qa-mpnet-base-dot-v1 768 109M 0.9197 sentence-transformers/all-MiniLM-L6-v2 384 22.7M 0.9164 sentence-transformers/msmarco-distilbert-base-v4 768 66.4M 0.9123

Slide 62

Slide 62 text

● The maintenance and execution costs (GPUs, LLMs, …) are significantly high ● Semantic Search may not be effective for all queries Potential Gains vs. Costs (ROI) Lexical Inverted Index Semantic ANN Index Can we simplify Lexical Search? Is it possible to reduce the costs?

Slide 63

Slide 63 text

● Opportunity size ○ Phase 1 Reranking ○ Separated UI ○ Rank Fusion Potential Gains vs. Costs (ROI) ● Costs ○ Latency ○ Throughput ○ Cost / 1M queries ○ GPU cost / month ○ Cost for dataset creation ○ Engineering cost

Slide 64

Slide 64 text

● Opportunity size ○ Phase 1 Reranking ○ Separated UI ○ Rank Fusion Potential Gains vs. Costs (ROI) ● Costs ○ Latency ○ Throughput ○ Cost / 1M query ○ GPU cost / month ○ Cost for dataset creation ○ Engineering cost Optimal model might vary depending on the use case Which search engine should we adopt? What is the best way to show semantically matching items? The cost of obtaining labels remain high Latency is increasing Do we need another search team dedicated to semantic search? GPU shortage Utilizing visual features is great but significantly slows down interation What’s the point if users don’t want semantic search? How can we measure the opportunity size? Bias in dataset

Slide 65

Slide 65 text

User interviews revealed that when users can’t find appropriate items on our platform, they switch to Google. It highlights the importance of investing in new technology to stay competitive. “Keyword search is something that old people do” Query: “product attribute attribute…” → “What I want is …”

Slide 66

Slide 66 text

Summary

Slide 67

Slide 67 text

Summary ● Semantic Search is not a silver bullet ● Practical Approach ○ Understand the user behavior ○ Choose the right metrics ○ Optimize the system while considering costs ● Basics are still important in the era of AI ○ Just as Lexical Search can’t be replaced by Semantic Search, search engineers can’t be replaced by AI engineers (for now) rejasupotaro rejasupotaro Any thoughts? 👉

Slide 68

Slide 68 text

● Thanks to Sho Yokoi at Tohoku University for his invaluable advice on experimental setup and modeling Aknowledgement