Retrieval Augmented Generation
Azure Application Innovation Tech Lead
A type of language generation model that
combines pre-trained parametric and non-
parametric memory for language generation.
Give me the gist!
Query My Data
Add Results to Prompt
Give me the gist!
Send Chunk Save Embedding
6, 7, 8, 9
-2, -1 , 0, 1
2, 3, 4, 5
Level 1 - Retrieval
- BM25 (Best Match 25) / Traditional Full-Text Search
- TF-IDF (Term Frequency — Inverse Document Frequency)
- Neural Network Embeddings : ranks documents based on their
similarity in the vector space / HNSW algorithm
- Hybrid Search / RFF (Reciprocal Rank Fusion)
Level 2 - Ranking
- Semantic Ranking
Search Configuration Customer
Keyword 40.6 40.6 49.6
Vector (Ada-002) 43.8 45.0 58.3
Hybrid (Keyword + Vector) 48.4 48.4 58.8
Hybrid + Semantic ranker 60.1 50.0 72.0
Pure Vector search
(BM25 + ANN)
Exact keyword match
- Splitting documents to accommodate LLM context window limit.
- Helps ranking sections of documents.
- Each vector can embed a limited amount of data per model.
- A long passage with multiple topics into a single vector can
cause important nuance can get lost.
- Overlapping text might be helpful.
Tokens and Tokenization
~50K vocab size
[464, 5044, 1422, 470, 3272, 262, 4675, 780, 340, 373, 1165, 10032, 13]
(76 chars, 17 tokens)
(55 chars, 24 tokens)
[0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422]
…a continuous space
representation we can use
as model input
Embeddings for similar concepts will be close to each other in N-dimensional space
(e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”)
Less common words will tend to split into multiple tokens:
There’s a bias towards English in the BPE corpus:
Self-Attention (Transformer Model)
• Each self-attention “head” learns relationships between a token
and all other tokens in the context
• Multiple heads in a layer focus on learning different
relationships, including grammar and semantic
Embedding and Vector Space
RAG in Action
Fine Tuning vs Embedding
GPT can learn knowledge in two ways:
• Via model weights (i.e., fine-tune the model on a training set)
teaching specialized tasks, less reliable for factual recall.
not a base training, salt is in the water
• Via model inputs (i.e., insert the knowledge into an input
short-term memory, bound by token limits.
• Type of “transfer learning”
• It’s about teaching a new task, not new information or knowledge.
• It is not a reliable way to store knowledge as part of the model.
• Fine-tuning does not overrule hallucination (confabulation).
• Slow, difficult and expensive.
• Fine tuning is 1000x more difficult compared to prompt
• Fast, easy, cheap
• Recalls exact information.
• Adding new content is quick, easy.
• Way more scalable
- Vector Search Is Not Enough
- Reciprocal Rank Fusion (RRF) for hybrid queries
http://daron.me | @daronyondem