Developing and serving RAG-Based LLM applications in production

Slide 1

Slide 1 text

Building RAG-based LLM Applications for Production Goku Mohandas Philipp Moritz

Slide 2

Slide 2 text

Ray Assistant

Slide 3

Slide 3 text

Query LLM Response Base LLMs

Slide 4

Slide 4 text

Vector DB Query LLM Response Retrieved contexts Embedding model 1 2 3 4 5 RAG

Slide 5

Slide 5 text

Vector DB Query LLM Response Retrieved contexts Embedding model RAG

Slide 6

Slide 6 text

Docs TextSplitter Vector DB Embedding model Chunk Embed Index Load sources Data sources Creating our Vector DB Vector DB (text, source, embedding) data source chunks

Slide 7

Slide 7 text

{'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#environ ments', 'text': '\nEnvironments#\nRLlib works with several different types of environments, including Farama-Foundation Gymnasium, user-defined, multi-agent, and also batched environments.\nTip\nNot all environments work with all algorithms. Check out the algorithm overview for more information.\n'} {'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#configu ring-environments', 'text': '\nConfiguring Environments#\nYou can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym environment name.\nCustom env classes passed directly to the algorithm must take a single env_config parameter in their constructor:\nimport gymnasium as gym\...} Sections

Slide 8

Slide 8 text

Vector DB Query embedding (text, source) top-k Query Embedding model Embed { Retrieval

Slide 9

Slide 9 text

(text, source) top-k contexts Query LLM Response Generation

Slide 10

Slide 10 text

Vector DB Query LLM Response Retrieved contexts Embedding model Source context retrieval_score quality_score (LLM) Evaluation (component-wise)

Slide 11

Slide 11 text

Evaluator

Slide 12

Slide 12 text

Vector DB Query LLM Response Retrieved contexts Embedding model quality_score (overall) Reference response Evaluation (overall)

Slide 13

Slide 13 text

Vector DB Query LLM Response Retrieved contexts Embedding model llms = ["gpt-3.5-turbo", "gpt-4", "meta-llama/Llama-2-7b", "meta-llama/Llama-2-13b", "meta-llama/Llama-2-70b", "codellama/CodeLlama-34b-Instruct-hf", "mistralai/Mistral-7B-Instruct-v0.1"] embedding_model_names = ["thenlper/gte-base", "thenlper/gte-large", "BAAI/bge-large-en", "text-embedding-ada-002"] chunk_sizes = [100, 300, 500, 700, 900] num_chunks_list = [1, 3, 5, 7, 9] Experiments

Slide 14

Slide 14 text

generate_responses() evaluate_responses() Evaluator (GPT-4) Reference responses Source context Generated responses "Your job is to rate the quality of our generated answer {generated answer} given a query {query} and a reference answer {reference answer}." "Answer the query using the context provided. Be succinct." Experiments

Slide 15

Slide 15 text

LLM [{'question': 'Does RLlib work with Farma-Foundation Gymnasium?', 'source': 'https://docs.ray.io/en/latest/rlli b/rllib-env.html#environments', 'answer': 'Yes, RLlib works with several types of environments including Farma-Foundation Gymnasium.'}, Cold start

Slide 16

Slide 16 text

Context

Slide 17

Slide 17 text

Chunk size

Slide 18

Slide 18 text

# of chunks

Slide 19

Slide 19 text

Embedding models

Slide 20

Slide 20 text

OSS vs. closed LLMs

Slide 21

Slide 21 text

Fine-tuning Query Embedding model with resized tokenizer Fine-tuned: full parameter, frozen layers, linear adapter, etc. Embedding Negative contexts Positive context high similarity low similarity Multiple Negatives Ranking Loss

Slide 22

Slide 22 text

Fine-tuning

Slide 23

Slide 23 text

Prompt engineering prompt-ignore-irrelevant-contexts retrieval score: 0.6892655367231638 quality score: 3.559322033898305 ↺ evaluate compare iterate Prompt engineering - x-of-thought - multimodal - self-refine - query decomp - additional context - etc.

Slide 24

Slide 24 text

Lexical search Query Embedding model BM25 index semantic search lexical search Vector DB Retrieved contexts Retrieved contexts top_k top_i LLM

Slide 25

Slide 25 text

Reranking Query Embedding model Vector DB Retrieved contexts high num_chunks Reranker Predicted tag Reranked contexts rerank contexts with tag top_k LLM supervised model trained using (query, tag) dataset (created via self-supervision)

Slide 26

Slide 26 text

Reranking

Slide 27

Slide 27 text

Cost analysis

Slide 28

Slide 28 text

Vector DB Query OSS LLMs Response Retrieved contexts Embedding model Supervised classifier ChatGPT 1 2 3 4 5 Hybrid routing

Slide 29

Slide 29 text

Data flywheel 1. Use feedback to identify underperforming queries (👍/👎, visited source pages, top-k cosine scores). 2. Inspect the retrieved resources, tokenization, etc. to decide if it's a shortcoming of retrieval, generation or data source. 3. If something in the data can be improved (add more info, separate into sections, etc.) → fix it! 4. Evaluate and add to test suite. 5. Reindex and deploy.

Slide 30

Slide 30 text

Vector DB Query LLM Agent #1 Response Retrieved contexts Embedding model LLM Agent #2 LLM Agent #N ... Foundational agents

Slide 31

Slide 31 text

Who scales AI with Ray? Anyscale - Scale AI, Productionize AI, for less Founded in 2019 Raised $260M Developers 40K+ Contributors 800+

Slide 32

Slide 32 text

Anyscale Endpoints Fine-tuning, dedicated/private endpoints, optimizations, observability, API compatible, no rate limits, etc. https://www.anyscale.com/endpoints