Retrieval Augmented Generation

Retrieval Augmented Generation Daron Yöndem Azure Application Innovation Tech Lead
Microsoft https://daron.me

A type of language generation model that combines pre-trained parametric
and non- parametric memory for language generation.

Give me the gist! User Question LLM Workflow Query My
Data Add Results to Prompt Query Model Send Results

Give me the gist! Documents Embedding Workflow Send Chunk Save
Embedding Chunking Hybrid Search Enablement 6, 7, 8, 9 -2, -1 , 0, 1 2, 3, 4, 5

Level 1 - Retrieval - BM25 (Best Match 25) /
Traditional Full-Text Search - TF-IDF (Term Frequency — Inverse Document Frequency) - Neural Network Embeddings : ranks documents based on their similarity in the vector space / HNSW algorithm - Hybrid Search / RFF (Reciprocal Rank Fusion)

Level 2 - Ranking - Semantic Ranking Search Configuration Customer
datasets [NDCG@3] Beir [NDCG@10] Multilingual Academic (MIRACL) [NDCG@10] Keyword 40.6 40.6 49.6 Vector (Ada-002) 43.8 45.0 58.3 Hybrid (Keyword + Vector) 48.4 48.4 58.8 Hybrid + Semantic ranker 60.1 50.0 72.0

Retrieval Models Full-text search (BM25) Pure Vector search (ANN) Hybrid
search (BM25 + ANN) Exact keyword match Proximity search Term weighting Semantic similarity search Multi-modal search Multi-lingual search

Document Chunking - Splitting documents to accommodate LLM context window
limit. - Helps ranking sections of documents. - Each vector can embed a limited amount of data per model. - A long passage with multiple topics into a single vector can cause important nuance can get lost. - Overlapping text might be helpful.

Tokens and Tokenization ~50K vocab size [464, 5044, 1422, 470,
3272, 262, 4675, 780, 340, 373, 1165, 10032, 13] 60 chars (76 chars, 17 tokens) (55 chars, 24 tokens) [0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422] 13 tokens N-dimensional embedding vector per token …a continuous space representation we can use as model input Embeddings for similar concepts will be close to each other in N-dimensional space (e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”) Less common words will tend to split into multiple tokens: There’s a bias towards English in the BPE corpus: dog chair hound

Self-Attention (Transformer Model) Intuition: • Each self-attention “head” learns relationships
between a token and all other tokens in the context • Multiple heads in a layer focus on learning different relationships, including grammar and semantic

word2vec Embedding and Vector Space

RAG in Action

Fine Tuning vs Embedding GPT can learn knowledge in two
ways: • Via model weights (i.e., fine-tune the model on a training set) teaching specialized tasks, less reliable for factual recall. not a base training, salt is in the water • Via model inputs (i.e., insert the knowledge into an input message) short-term memory, bound by token limits.

Fine Tuning • Type of “transfer learning” • It’s about
teaching a new task, not new information or knowledge. • It is not a reliable way to store knowledge as part of the model. • Fine-tuning does not overrule hallucination (confabulation). • Slow, difficult and expensive. • Fine tuning is 1000x more difficult compared to prompt engineering.

Embeddings • Fast, easy, cheap • Recalls exact information. •
Adding new content is quick, easy. • Way more scalable

Resources - Vector Search Is Not Enough https://drn.fyi/45Wy8Tk - Reciprocal
Rank Fusion (RRF) for hybrid queries https://drn.fyi/44Z5EqK

Thanks http://daron.me | @daronyondem

Retrieval Augmented Generation

Retrieval Augmented Generation

Daron Yondem

More Decks by Daron Yondem

Other Decks in Technology

Featured

Transcript