Developing and serving RAG-Based LLM applications in production

Building RAG-based LLM Applications for Production Goku Mohandas Philipp Moritz

Ray Assistant

Query LLM Response Base LLMs

Vector DB Query LLM Response Retrieved contexts Embedding model 1
2 3 4 5 RAG

Vector DB Query LLM Response Retrieved contexts Embedding model RAG

Docs TextSplitter Vector DB Embedding model Chunk Embed Index Load
sources Data sources Creating our Vector DB Vector DB (text, source, embedding) data source chunks

{'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#environ ments', 'text': '\nEnvironments#\nRLlib works with several different types
of environments, including Farama-Foundation Gymnasium, user-defined, multi-agent, and also batched environments.\nTip\nNot all environments work with all algorithms. Check out the algorithm overview for more information.\n'} {'source': 'https://docs.ray.io/en/master/rllib/rllib-env.html#configu ring-environments', 'text': '\nConfiguring Environments#\nYou can pass either a string name or a Python class to specify an environment. By default, strings will be interpreted as a gym environment name.\nCustom env classes passed directly to the algorithm must take a single env_config parameter in their constructor:\nimport gymnasium as gym\...} Sections

Vector DB Query embedding (text, source) top-k Query Embedding model
Embed { Retrieval

(text, source) top-k contexts Query LLM Response Generation

Vector DB Query LLM Response Retrieved contexts Embedding model Source
context retrieval_score quality_score (LLM) Evaluation (component-wise)

Evaluator

Vector DB Query LLM Response Retrieved contexts Embedding model quality_score
(overall) Reference response Evaluation (overall)

Vector DB Query LLM Response Retrieved contexts Embedding model llms
= ["gpt-3.5-turbo", "gpt-4", "meta-llama/Llama-2-7b", "meta-llama/Llama-2-13b", "meta-llama/Llama-2-70b", "codellama/CodeLlama-34b-Instruct-hf", "mistralai/Mistral-7B-Instruct-v0.1"] embedding_model_names = ["thenlper/gte-base", "thenlper/gte-large", "BAAI/bge-large-en", "text-embedding-ada-002"] chunk_sizes = [100, 300, 500, 700, 900] num_chunks_list = [1, 3, 5, 7, 9] Experiments

generate_responses() evaluate_responses() Evaluator (GPT-4) Reference responses Source context Generated responses
"Your job is to rate the quality of our generated answer {generated answer} given a query {query} and a reference answer {reference answer}." "Answer the query using the context provided. Be succinct." Experiments

LLM [{'question': 'Does RLlib work with Farma-Foundation Gymnasium?', 'source': 'https://docs.ray.io/en/latest/rlli
b/rllib-env.html#environments', 'answer': 'Yes, RLlib works with several types of environments including Farma-Foundation Gymnasium.'}, Cold start

Context

Chunk size

# of chunks

Embedding models

OSS vs. closed LLMs

Fine-tuning Query Embedding model with resized tokenizer Fine-tuned: full parameter,
frozen layers, linear adapter, etc. Embedding Negative contexts Positive context high similarity low similarity Multiple Negatives Ranking Loss

Fine-tuning

Prompt engineering prompt-ignore-irrelevant-contexts retrieval score: 0.6892655367231638 quality score: 3.559322033898305 ↺
evaluate compare iterate Prompt engineering - x-of-thought - multimodal - self-refine - query decomp - additional context - etc.

Lexical search Query Embedding model BM25 index semantic search lexical
search Vector DB Retrieved contexts Retrieved contexts top_k top_i LLM

Reranking Query Embedding model Vector DB Retrieved contexts high num_chunks
Reranker Predicted tag Reranked contexts rerank contexts with tag top_k LLM supervised model trained using (query, tag) dataset (created via self-supervision)

Reranking

Cost analysis

Vector DB Query OSS LLMs Response Retrieved contexts Embedding model
Supervised classifier ChatGPT 1 2 3 4 5 Hybrid routing

Data flywheel 1. Use feedback to identify underperforming queries (👍/👎,
visited source pages, top-k cosine scores). 2. Inspect the retrieved resources, tokenization, etc. to decide if it's a shortcoming of retrieval, generation or data source. 3. If something in the data can be improved (add more info, separate into sections, etc.) → fix it! 4. Evaluate and add to test suite. 5. Reindex and deploy.

Vector DB Query LLM Agent #1 Response Retrieved contexts Embedding
model LLM Agent #2 LLM Agent #N ... Foundational agents

Who scales AI with Ray? Anyscale - Scale AI, Productionize
AI, for less Founded in 2019 Raised $260M Developers 40K+ Contributors 800+

Anyscale Endpoints Fine-tuning, dedicated/private endpoints, optimizations, observability, API compatible, no
rate limits, etc. https://www.anyscale.com/endpoints

Blog posts https://www.anyscale.com/blog

Development → Production

Thank you!

Developing and serving RAG-Based LLM applicatio...

Developing and serving RAG-Based LLM applications in production

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript