Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing and serving RAG-Based LLM applications in production

December 07, 2023

Developing and serving RAG-Based LLM applications in production

There are a lot of different moving pieces when it comes to developing and serving LLM applications. This talk will provide a comprehensive guide for developing retrieval augmented generation (RAG) based LLM applications — with a focus on scale (embed, index, serve, etc.), evaluation (component-wise and overall) and production workflows. We’ll also explore more advanced topics such as hybrid routing to close the gap between OSS and closed LLMs.


1. ​Evaluating RAG-based LLM applications is crucial for identifying and productionizing the best configuration.
2. Developing your LLM application with scalable workloads involves minimal changes to existing code.
3. Mixture of Experts (MoE) routing allows you to close the gap between open source and closed LLMs.


December 07, 2023

More Decks by Anyscale

Other Decks in Technology


  1. Docs TextSplitter Vector DB Embedding model Chunk Embed Index Load

    sources Data sources Creating our Vector DB Vector DB (text, source, embedding) data source chunks
  2. {'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#environ ments', 'text': '\nEnvironments#\nRLlib works with several different types

    of environments, including Farama-Foundation Gymnasium, user-defined, multi-agent, and also batched environments.\nTip\nNot all environments work with all algorithms. Check out the algorithm overview for more information.\n'} {'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#configu ring-environments', 'text': '\nConfiguring Environments#\nYou can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym environment name.\nCustom env classes passed directly to the algorithm must take a single env_config parameter in their constructor:\nimport gymnasium as gym\...} Sections
  3. Vector DB Query LLM Response Retrieved contexts Embedding model Source

    context retrieval_score quality_score (LLM) Evaluation (component-wise)
  4. Vector DB Query LLM Response Retrieved contexts Embedding model quality_score

    (overall) Reference response Evaluation (overall)
  5. Vector DB Query LLM Response Retrieved contexts Embedding model llms

    = ["gpt-3.5-turbo", "gpt-4", "meta-llama/Llama-2-7b", "meta-llama/Llama-2-13b", "meta-llama/Llama-2-70b", "codellama/CodeLlama-34b-Instruct-hf", "mistralai/Mistral-7B-Instruct-v0.1"] embedding_model_names = ["thenlper/gte-base", "thenlper/gte-large", "BAAI/bge-large-en", "text-embedding-ada-002"] chunk_sizes = [100, 300, 500, 700, 900] num_chunks_list = [1, 3, 5, 7, 9] Experiments
  6. generate_responses() evaluate_responses() Evaluator (GPT-4) Reference responses Source context Generated responses

    "Your job is to rate the quality of our generated answer {generated answer} given a query {query} and a reference answer {reference answer}." "Answer the query using the context provided. Be succinct." Experiments
  7. LLM [{'question': 'Does RLlib work with Farma-Foundation Gymnasium?', 'source': 'https://docs.ray.io/en/latest/rlli

    b/rllib-env.html#environments', 'answer': 'Yes, RLlib works with several types of environments including Farma-Foundation Gymnasium.'}, Cold start
  8. Fine-tuning Query Embedding model with resized tokenizer Fine-tuned: full parameter,

    frozen layers, linear adapter, etc. Embedding Negative contexts Positive context high similarity low similarity Multiple Negatives Ranking Loss
  9. Prompt engineering prompt-ignore-irrelevant-contexts retrieval score: 0.6892655367231638 quality score: 3.559322033898305 ↺

    evaluate compare iterate Prompt engineering - x-of-thought - multimodal - self-refine - query decomp - additional context - etc.
  10. Lexical search Query Embedding model BM25 index semantic search lexical

    search Vector DB Retrieved contexts Retrieved contexts top_k top_i LLM
  11. Reranking Query Embedding model Vector DB Retrieved contexts high num_chunks

    Reranker Predicted tag Reranked contexts rerank contexts with tag top_k LLM supervised model trained using (query, tag) dataset (created via self-supervision)
  12. Vector DB Query OSS LLMs Response Retrieved contexts Embedding model

    Supervised classifier ChatGPT 1 2 3 4 5 Hybrid routing
  13. Data flywheel 1. Use feedback to identify underperforming queries (👍/👎,

    visited source pages, top-k cosine scores). 2. Inspect the retrieved resources, tokenization, etc. to decide if it's a shortcoming of retrieval, generation or data source. 3. If something in the data can be improved (add more info, separate into sections, etc.) → fix it! 4. Evaluate and add to test suite. 5. Reindex and deploy.
  14. Vector DB Query LLM Agent #1 Response Retrieved contexts Embedding

    model LLM Agent #2 LLM Agent #N ... Foundational agents
  15. Who scales AI with Ray? Anyscale - Scale AI, Productionize

    AI, for less Founded in 2019 Raised $260M Developers 40K+ Contributors 800+