$30 off During Our Annual Pro Sale. View Details »

Retrieval Augmented Generation

Daron Yondem
September 20, 2023

Retrieval Augmented Generation

Daron Yondem

September 20, 2023
Tweet

More Decks by Daron Yondem

Other Decks in Technology

Transcript

  1. A type of language generation model that combines pre-trained parametric

    and non- parametric memory for language generation.
  2. Give me the gist! User Question LLM Workflow Query My

    Data Add Results to Prompt Query Model Send Results
  3. Give me the gist! Documents Embedding Workflow Send Chunk Save

    Embedding Chunking Hybrid Search Enablement 6, 7, 8, 9 -2, -1 , 0, 1 2, 3, 4, 5
  4. Level 1 - Retrieval - BM25 (Best Match 25) /

    Traditional Full-Text Search - TF-IDF (Term Frequency — Inverse Document Frequency) - Neural Network Embeddings : ranks documents based on their similarity in the vector space / HNSW algorithm - Hybrid Search / RFF (Reciprocal Rank Fusion)
  5. Level 2 - Ranking - Semantic Ranking Search Configuration Customer

    datasets [NDCG@3] Beir [NDCG@10] Multilingual Academic (MIRACL) [NDCG@10] Keyword 40.6 40.6 49.6 Vector (Ada-002) 43.8 45.0 58.3 Hybrid (Keyword + Vector) 48.4 48.4 58.8 Hybrid + Semantic ranker 60.1 50.0 72.0
  6. Retrieval Models Full-text search (BM25) Pure Vector search (ANN) Hybrid

    search (BM25 + ANN) Exact keyword match Proximity search Term weighting Semantic similarity search Multi-modal search Multi-lingual search
  7. Document Chunking - Splitting documents to accommodate LLM context window

    limit. - Helps ranking sections of documents. - Each vector can embed a limited amount of data per model. - A long passage with multiple topics into a single vector can cause important nuance can get lost. - Overlapping text might be helpful.
  8. Tokens and Tokenization ~50K vocab size [464, 5044, 1422, 470,

    3272, 262, 4675, 780, 340, 373, 1165, 10032, 13] 60 chars (76 chars, 17 tokens) (55 chars, 24 tokens) [0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422] 13 tokens N-dimensional embedding vector per token …a continuous space representation we can use as model input Embeddings for similar concepts will be close to each other in N-dimensional space (e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”) Less common words will tend to split into multiple tokens: There’s a bias towards English in the BPE corpus: dog chair hound
  9. Self-Attention (Transformer Model) Intuition: • Each self-attention “head” learns relationships

    between a token and all other tokens in the context • Multiple heads in a layer focus on learning different relationships, including grammar and semantic
  10. Fine Tuning vs Embedding GPT can learn knowledge in two

    ways: • Via model weights (i.e., fine-tune the model on a training set) teaching specialized tasks, less reliable for factual recall. not a base training, salt is in the water • Via model inputs (i.e., insert the knowledge into an input message) short-term memory, bound by token limits.
  11. Fine Tuning • Type of “transfer learning” • It’s about

    teaching a new task, not new information or knowledge. • It is not a reliable way to store knowledge as part of the model. • Fine-tuning does not overrule hallucination (confabulation). • Slow, difficult and expensive. • Fine tuning is 1000x more difficult compared to prompt engineering.
  12. Embeddings • Fast, easy, cheap • Recalls exact information. •

    Adding new content is quick, easy. • Way more scalable
  13. Resources - Vector Search Is Not Enough https://drn.fyi/45Wy8Tk - Reciprocal

    Rank Fusion (RRF) for hybrid queries https://drn.fyi/44Z5EqK