$30 off During Our Annual Pro Sale. View Details »

Retrieval Augmented Generation

Daron Yondem
September 20, 2023

Retrieval Augmented Generation

Daron Yondem

September 20, 2023
Tweet

More Decks by Daron Yondem

Other Decks in Technology

Transcript

  1. Retrieval Augmented Generation
    Daron Yöndem
    Azure Application Innovation Tech Lead
    Microsoft
    https://daron.me

    View Slide

  2. A type of language generation model that
    combines pre-trained parametric and non-
    parametric memory for language generation.

    View Slide

  3. Give me the gist!
    User Question
    LLM Workflow
    Query My Data
    Add Results to Prompt
    Query Model
    Send Results

    View Slide

  4. Give me the gist!
    Documents
    Embedding
    Workflow
    Send Chunk Save Embedding
    Chunking
    Hybrid Search
    Enablement
    6, 7, 8, 9
    -2, -1 , 0, 1
    2, 3, 4, 5

    View Slide

  5. Level 1 - Retrieval
    - BM25 (Best Match 25) / Traditional Full-Text Search
    - TF-IDF (Term Frequency — Inverse Document Frequency)
    - Neural Network Embeddings : ranks documents based on their
    similarity in the vector space / HNSW algorithm
    - Hybrid Search / RFF (Reciprocal Rank Fusion)

    View Slide

  6. Level 2 - Ranking
    - Semantic Ranking
    Search Configuration Customer
    datasets
    [NDCG@3]
    Beir
    [NDCG@10]
    Multilingual Academic
    (MIRACL)
    [NDCG@10]
    Keyword 40.6 40.6 49.6
    Vector (Ada-002) 43.8 45.0 58.3
    Hybrid (Keyword + Vector) 48.4 48.4 58.8
    Hybrid + Semantic ranker 60.1 50.0 72.0

    View Slide

  7. Retrieval Models
    Full-text search
    (BM25)
    Pure Vector search
    (ANN)
    Hybrid search
    (BM25 + ANN)
    Exact keyword match
    Proximity search
    Term weighting
    Semantic similarity
    search
    Multi-modal search
    Multi-lingual search

    View Slide

  8. Document Chunking
    - Splitting documents to accommodate LLM context window limit.
    - Helps ranking sections of documents.
    - Each vector can embed a limited amount of data per model.
    - A long passage with multiple topics into a single vector can
    cause important nuance can get lost.
    - Overlapping text might be helpful.

    View Slide

  9. Tokens and Tokenization
    ~50K vocab size
    [464, 5044, 1422, 470, 3272, 262, 4675, 780, 340, 373, 1165, 10032, 13]
    60 chars
    (76 chars, 17 tokens)
    (55 chars, 24 tokens)
    [0.653249, -0.211342, 0.000436 … -0.532995, 0.900358, 0.345422]
    13 tokens
    N-dimensional
    embedding
    vector
    per token
    …a continuous space
    representation we can use
    as model input
    Embeddings for similar concepts will be close to each other in N-dimensional space
    (e.g., vectors for “dog” and “hound” will have a cosine similarity closer to 1 than “dog” and “chair”)
    Less common words will tend to split into multiple tokens:
    There’s a bias towards English in the BPE corpus:
    dog
    chair
    hound

    View Slide

  10. Self-Attention (Transformer Model)
    Intuition:
    • Each self-attention “head” learns relationships between a token
    and all other tokens in the context
    • Multiple heads in a layer focus on learning different
    relationships, including grammar and semantic

    View Slide

  11. word2vec
    Embedding and Vector Space

    View Slide

  12. RAG in Action

    View Slide

  13. Fine Tuning vs Embedding
    GPT can learn knowledge in two ways:
    • Via model weights (i.e., fine-tune the model on a training set)
    teaching specialized tasks, less reliable for factual recall.
    not a base training, salt is in the water
    • Via model inputs (i.e., insert the knowledge into an input
    message)
    short-term memory, bound by token limits.

    View Slide

  14. Fine Tuning
    • Type of “transfer learning”
    • It’s about teaching a new task, not new information or knowledge.
    • It is not a reliable way to store knowledge as part of the model.
    • Fine-tuning does not overrule hallucination (confabulation).
    • Slow, difficult and expensive.
    • Fine tuning is 1000x more difficult compared to prompt
    engineering.

    View Slide

  15. Embeddings
    • Fast, easy, cheap
    • Recalls exact information.
    • Adding new content is quick, easy.
    • Way more scalable

    View Slide

  16. Resources
    - Vector Search Is Not Enough
    https://drn.fyi/45Wy8Tk
    - Reciprocal Rank Fusion (RRF) for hybrid queries
    https://drn.fyi/44Z5EqK

    View Slide

  17. Thanks
    http://daron.me | @daronyondem

    View Slide