Slide 1

Slide 1 text

Retrieval Augmented Generation Daron Yöndem Azure Application Innovation Tech Lead Microsoft https://daron.me

Slide 2

Slide 2 text

A type of language generation model that combines pre-trained parametric and non- parametric memory for language generation.

Slide 3

Slide 3 text

Give me the gist! User Question LLM Workflow Query My Data Add Results to Prompt Query Model Send Results

Slide 4

Slide 4 text

Give me the gist! Documents Embedding Workflow Send Chunk Save Embedding Chunking Hybrid Search Enablement 6, 7, 8, 9 -2, -1 , 0, 1 2, 3, 4, 5

Slide 5

Slide 5 text

Level 1 - Retrieval - BM25 (Best Match 25) / Traditional Full-Text Search - TF-IDF (Term Frequency — Inverse Document Frequency) - Neural Network Embeddings : ranks documents based on their similarity in the vector space / HNSW algorithm - Hybrid Search / RFF (Reciprocal Rank Fusion)

Slide 6

Slide 6 text

Level 2 - Ranking - Semantic Ranking Search Configuration Customer datasets [NDCG@3] Beir [NDCG@10] Multilingual Academic (MIRACL) [NDCG@10] Keyword 40.6 40.6 49.6 Vector (Ada-002) 43.8 45.0 58.3 Hybrid (Keyword + Vector) 48.4 48.4 58.8 Hybrid + Semantic ranker 60.1 50.0 72.0

Slide 7

Slide 7 text

Retrieval Models Full-text search (BM25) Pure Vector search (ANN) Hybrid search (BM25 + ANN) Exact keyword match Proximity search Term weighting Semantic similarity search Multi-modal search Multi-lingual search

Slide 8

Slide 8 text

Document Chunking - Splitting documents to accommodate LLM context window limit. - Helps ranking sections of documents. - Each vector can embed a limited amount of data per model. - A long passage with multiple topics into a single vector can cause important nuance can get lost. - Overlapping text might be helpful.

Slide 9

Slide 9 text

Tokens and Tokenization ~50K vocab size [464, 5044, 1422, 470, 3272, 262, 4675, 780, 340, 373, 1165, 10032, 13] 60 chars (76 chars, 17 tokens) (55 chars, 24 tokens) [0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422] 13 tokens N-dimensional embedding vector per token …a continuous space representation we can use as model input Embeddings for similar concepts will be close to each other in N-dimensional space (e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”) Less common words will tend to split into multiple tokens: There’s a bias towards English in the BPE corpus: dog chair hound

Slide 10

Slide 10 text

Self-Attention (Transformer Model) Intuition: • Each self-attention “head” learns relationships between a token and all other tokens in the context • Multiple heads in a layer focus on learning different relationships, including grammar and semantic

Slide 11

Slide 11 text

word2vec Embedding and Vector Space

Slide 12

Slide 12 text

RAG in Action

Slide 13

Slide 13 text

Fine Tuning vs Embedding GPT can learn knowledge in two ways: • Via model weights (i.e., fine-tune the model on a training set) teaching specialized tasks, less reliable for factual recall. not a base training, salt is in the water • Via model inputs (i.e., insert the knowledge into an input message) short-term memory, bound by token limits.

Slide 14

Slide 14 text

Fine Tuning • Type of “transfer learning” • It’s about teaching a new task, not new information or knowledge. • It is not a reliable way to store knowledge as part of the model. • Fine-tuning does not overrule hallucination (confabulation). • Slow, difficult and expensive. • Fine tuning is 1000x more difficult compared to prompt engineering.

Slide 15

Slide 15 text

Embeddings • Fast, easy, cheap • Recalls exact information. • Adding new content is quick, easy. • Way more scalable

Slide 16

Slide 16 text

Resources - Vector Search Is Not Enough https://drn.fyi/45Wy8Tk - Reciprocal Rank Fusion (RRF) for hybrid queries https://drn.fyi/44Z5EqK

Slide 17

Slide 17 text

Thanks http://daron.me | @daronyondem