Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An Introduction to Semantic Search

An Introduction to Semantic Search

This presentation was given @VJAI on Mar 3rd, 2025.

Avatar for Dat Nguyen

Dat Nguyen

March 27, 2025
Tweet

More Decks by Dat Nguyen

Other Decks in Technology

Transcript

  1. AGENDA Information Retrieval system Semantic Search system Framework of Semantic

    Search system DPR as a Dense Search Model SPLADE as a Sparse Search Model ColBERT as a Late-Interaction Search Model Vector Search 2
  2. Overview of Information Retrieval system Information Retrieval (IR) system: a

    system that returns a set documents which are relevant to a given query IR system usually deal with unstructured documents Although modern IR systems can search for images, sound…, traditional IR systems usually search for textual documents This talk focus on Text Information Retrieval System only 4
  3. Traditional approach for Text Information Retrieval system Return text documents

    relevant to given text query Estimate the relevance score of a each document w.r.t the given query Return the relevant documents using the score E.g., may be similarity score between Documents with higher similarity scores are more relevant to the query Sort documents by similarity score and return TOP-K documents We will examine several traditional methods to estimate the relevance score 5
  4. TF-IDF Given a corpus of N documents in , pre-process

    the corpu D and build a vocabulary With any , a document , compute the Term-Frequency tf and Inverse Document Frequency idf where is the number of documents that contain the term . tf-idf score for the term t with respect to the document d is The simplest way to compute the query score is to sum up tf-idf score of each query term 6
  5. BM25 where : parameter which are usually ranges in [1.2,

    2.0] b: parameter which is usually 0.75 avgdl: average length of documents in collection D 7
  6. Problems of traditional IR systems Although traditional IR systems are

    simple, they have the following key problems: Short query Short queries contain few terms, which leads the search system to return few or no results Vocabulary mismatch Queries and documents may use different vocabularies to express the same concept Queries may contain new terms Lack of semantic understanding Use only the count of query terms in the document but ignore their positions and interactions with other terms 8
  7. Semantic search Search documents by understanding the document content and

    intent of users via search query Neural search systems, especially Transformer-based models, are dominating semantic search Transformer-based models analyze query/document Represent the query/document as vectors in high-dimensional space Search doc vectors which are close to query vectors, then return the corresponding docs as search result 10
  8. Performance of Neural Network LM retrieval Passage retrieval task (2020)

    Credit: Craswell et al. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. SIGIR 2021. (link) 11
  9. Applications of Semantic search Question Answering May use Semantic Search

    model to specify relevant documents for the given question The apply some answering models to extract the answer from the relevant documents RAG (Retrieval-Augmented Generation) RAG becomes more popular after ChatGPT was published. RAG is a special case of Question Answer, in which the answering model is an LLM Embed question along with relevant documents into prompt LLM will automatically generate the answer for the question 12
  10. Dense vs Sparse DENSE SPARSE Low-dimensional vector. Dimension ranges from

    several hundreds to several thousands. High-dimensional vector. Dimension ranges from tens of thousands to hundreds of thousands. Almost all of the elements are non-zero. Almost all of the elements are zero. The inner product between two d-dimensional Dense vectors requires d multiplication operations and d - 1 addition operations. The number operations required to compute the inner-products between two d-dimentional SPARSE vectors depends on the sparsity of the vectors. 18
  11. Types of search models Sparse Dense Unsupervised BM25, TF-IDF LDA

    Supervised SPLADE, DeepCT, TILDE DPR, ANCE, ColBERT 19
  12. Bi-Encoder vs Cross-Encoder Bi-Encoder Document encoding can be performed offline.

    At inference time, heavy-weight encoder is applied on query (basically short) only Good scalability. Cross-Encoder Cross-Encoder are more accurate but less scalable than Bi-Encoder. Should be used in reranking phrase. Late interaction Quality-cost tradeoff Represent each query (or document) as a bag of vectors Credit: Khattab and Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. SIGIR'20. (link) 20
  13. DPR - Dense Passage Retrieval One of the earliest method

    using BERT model to build Dense representations for retrieval phase of open-domain Question Answering (English) Segment long documents into short passages. Search for passages relevant to a given question Karpukhin et al. Dense passage retrieval for open-domain question answering. EMNLP 2020. (link) 22
  14. DPR training - Overview Find a vector space such that

    relevant pairs of question and passages will have smaller distance. Start from a pre-trained BERT model, DPR perform fine-tuning using Question Answering datasets. Training dataset: a set of samples , where : a question : document relevant to q, or positive document of q. In Question Answering task, document contains the answer to the question q. Some examples from Natural Questions (NQ) dataset Question Relevant passage where did they film diary of a wimpy kid Diary of a Wimpy Kid (film) Filming of Diary of a Wimpy Kid was in Vancouver and wrapped up on October 16, 2009. who does the voice of arby's commercial Ving Rhames Rhames's deep voice is the center of many recent (2015-present) Arby's commercials, with the catchline 'Arby's, we have the meats!' when did the us military start using hummers Humvee In 1979, the U.S. Army drafted final specifications for a High Mobility Multipurpose Wheeled Vehicle (HMMWV), which was to replace all the tactical vehicles in the 1/4 to 1 1/4-ton range... Passages other than can be considered as irrelevant to, or negative document of the question . 23
  15. DPR training - Loss function In-batch negative sampling: A mini-batch

    of B training samples For a query , consider positive documents of other queries ( ) as its negative documents ,..., So, the sample becomes a (B+1)-tuple: Loss function of the sample in the batch is defined as: 24
  16. DPR result DPR retrieval accuracy on various datasets DPR retrieval

    accuracy w.r.t number of training samples Except SQuAD, DPR outperforms BM25 on all datasets DPR gives higher accuracy over BM25 even being fine-tuned with small amount of samples Due to the effectiveness of Pre-trained Language Model. 25
  17. SPLADE SPLADE: Sparse Lexical AnD Expansion Term expansion using BERT

    while utilizing the strength of sparse lexical. SPLADE returns term expansion of the query which helps to semantically match with the relevant doc using just sparse lexical Credit: Example is borrowed from https://www.pinecone.io/learn/splade 27
  18. SPLADE - Compute sparse representation from input text Transform the

    input text with BERT and project in BERT vocabulary Input text tokens Pass the tokens to BERT and obtain hidden vectors Compute importance of the token in the vocab for a token of the input Score of each token in the vocabulary is Formal et al. SPLADE: Sparse lexical and expansion model for first stage ranking. SIGIR'21 short paper. (link) Formal et al. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. 2021. (link) 28
  19. SPLADE - Training Ranking loss-InfoNCE [1] In SPLADE, sim is

    dot-product Sparse Regularization [2] , where B is the batch size Total loss: , where is the on query is the on document [1] Formal et al. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. 2021. (link) [2] Paria el al. Minimizing FLOPs to Learn Efficient Sparse Representations. ICLR 2020. (link) 29
  20. SPLADE - Distillation Train a Cross-Encoder (CE) with training samples

    Generate triplets , along with similarity scores for , predicted by the CE. Fine-tune SPLADE with the triplets using Margin-MSE loss where : similarity score returned by the teacher model : similarity score returned by the student model Formal et al. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. 2021. (link) Hofstätter et al. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. 2020. (link) 30
  21. SPLADE - Result SPLADE outperforms BM25 and other sparse retrieval

    model Competitive to SoTA dense models Formal et al. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. 2021. (link) 31
  22. ColBERT Quality-cost tradeoff Contextualized late interaction paradigm over BERT Similarity

    score between query q and document d with maxsim operator: Advantages Enable offline document indexing like Bi-Encoder Analyze interaction between every tokens in query and document like Cross-Encoder. Disadvantage Slower and consume more storage than Bi-Encoder Credit: Khattab and Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. SIGIR'20. (link) 33
  23. ColBERT - Result (1) Santhanam et al. ColBERTv2: Effective and

    Efficient Retrieval via Lightweight Late Interaction. NAACL-HLT'22. (link) 34
  24. ColBERT - Result (2) ColBERT gives higher accuracy than BM25,

    and other Dense models like ANCE, TAS-B ColBERT is competitive to SPLADEv2 on BEIR benchmark (a), but outperforms SPLADEv2 on Wikipedia OpenQA and LoTTE benchmark (b) Santhanam et al. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL-HLT'22. (link) 35
  25. Dense vector search With the emerge of transformer-based model, dense

    vector search becomes more crucial Necessary on searching for data in multi modality ANN: Approximate Nearest Neighbor Search for data points that are very close to the given query point, but not necessarily the closest. Better scalability than methods with absolute solutions. Common approaches for ANN: Tree, hash, graph 38
  26. HNSW - Hierarch Navigable Small World One of the most

    popular Graph-based ANN algorithm Data point as node in a graph Credit: https://speakerdeck.com/matsui_528/jin-si-zui-jin-bang-tan-suo-falsezui-qian-xian?slide=59 39
  27. Key takeaways Semantic Search overcomes traditional exact-match search Modern Semantic

    Search models heavily depend on transformer-based Neural network. Semantic search consists of 2 steps Encode query and documents to vectors Search document vectors which are close to given query vector There are 2 types of Semantic Search models: Dense vector-based model: DPR, ColBERT... Sparse vector-based model: SPLADE 40