From Prototype to Production -A Practical Guide to LLMOps Platforms for RAG Applications on Azure

A Practical Guide to LLMOps Platforms for RAG Applications on
Azure From Prototype to Production: AI Community Day 2024 Jeroen Overschie Sander van Donkelaar

About us Jeroen Overschie Machine Learning Engineer @ Xebia Data
Sander van Donkelaar Machine Learning Engineer @ Xebia Data

Azure From Prototype to Production: Let’s go back to our title slide for a second …

Azure From Prototype to Production: Most importantly LLMOps and RAG. Why are those important?

User acquisition for popular services in days medium.com Well, first
things first: LLM’s are awesome So, why RAG? 👑 But still, this does not go without challenges.

LLM Large Language Model ? ⤫ Outdated knowledge ⤫ No
internal company data ⤫ Hallucinations Model interactions without RAG LLM Knowledge cut- off Gemini 1.0 Pro Early 2023 [1] GPT-3.5 turbo September 2021 [2] GPT-4 September 2021 [3] Knowledge cut-offs as of March 2024.

LLM Large Language Model ? ü Up to date knowledge
ü Internal company data ü Factual answers Documents ? Model interactions with RAG How RAG can supercharge your GenAI usage.

RAG sounds great! What are the main components for RAG?

LLM Large Language Model ? Documents ? Document retrieval 1
Answer generation 2 The RAG components: Retrieval and Generation. RAG sounds great! What are the main components for RAG?

Documents Doc to text Vector Database ……………………… ……………………… ……………………… ………………………
……………………… ……………………… Chunk text [x1, …, x768] [x1, …, x768] [x1, …, x768] Metadata & Embeddings [x1, …, x768] 🔗 • Title • Description • Publication date Metadata Database 🔍 🔗🔗 HTML </> TXT …… PDF …… Document references Convert search query to embedding ? API .txt, .pdf, .html Document retrieval 1 Retrieve documents Keyword search Vector search Matching document texts and metadata Question

Blob Storage [x1, …, x768] 🔍 Convert search query to
embedding ? .txt, .pdf, .html Document retrieval 1 Search query Bring your own embedding Question Performs keyword + vector search out of the box (Reciporical Rank Fusion) Azure AI Search text-embedding- ada-002 Optional steps Custom ingestion, Custom chunking, custom embeddings

f“”” A question of a user will follow. Answer the
question best to your ability, using the information provided in the documents. Only answer with truthful and citable information. The answer should be at the maximum five sentences. Question: '{query}’ Documents: { format_doc(doc) for each doc in docs } “”” Answer generation 2 LLM Large Language Model ? Container App Prompt Answer Question GPT-3.5 or GPT 4

LLM Large Language Model ? Documents ? Document retrieval 1
Answer generation 2 Great! We understood the main components for Retrieval Augmented Generation on Azure How to productionize?

We always need to follow these principles When Going to
Production IaC CI/CD Observability Faster go to market times Higher Auditability Lower Operational Costs Replace all manual effort with Terraform Every component is monitored All code is provisioned using CI/CD

A bit about Terraform: a declarative way to provision your
cloud infrastructure ü Automated ü Reproducible You define your infrastructure as code (IaC) Terraform will handle the deployment for you

Hallucinations Complex Traces & Flows Costs Hard to evaluate LLM
applications rely on complex, chained calls LLMs are unpredictable Heterogenous outputs makes it hard to evaluate Costs can skyrocket easily Capture full context of LLM execution Observability & Guardrails Create strong feedback mechanisms Link individual outputs to costs LLM applications have unique challenges when productionizing Need Why

Let’s start to build our platform! How do we get
started?

Component #1 We need to have a front-end Frontend LLM
Orchestrator Monitoring Stack LLM Endpoints Vector Database End users should be able to communicate with our LLM application If we containerize it, deployment becomes easy! ü Easily deploy as (secured) web service using Azure Container Apps! ! We used react, but use anything you want

Component #2 We need to have a LLM orchestrator (backend)
Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database The LLM orchestrator acts as “middleman” It orchestrates subsequent calls to LLM endpoints and Vector Databases Incorporate logic to filter & scan model outputs (Guardrails) ü Easily deploy on Azure Container Apps

Component #3 We need enable our LLM endpoints Frontend LLM
Orchestrator Monitoring Stack LLM Endpoints Vector Database Endpoints to generate completions and embeddings should be enabled.

Component #4 We need Azure AI Search for our vector
database Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database We use Azure AI Search as our vector database. This allows us to connect (and index) different data sources. But it is much more!

Component #5 We need a way of making our LLM
applications observable Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database We need a strong monitoring component for visibility into every step in our LLM application.

On LLM monitoring Our AI apps should be completely visible

? ⤫ Are the documents added correctly to the prompt?
⤫ Is my LLM actually basing its answer on the documents? How to monitor your AI APP without visibility? My AI application

This is still a simple example LLM interactions can become
very complex! ? Rephrase Query Route Query to index [x1, …, x768] Convert to embedding Retrieve Documents Generate response Evaluate response …

On LLM monitoring Tracing You want need to capture the
full context of LLM execution. A trace outlines the different steps and results in an execution chain LLM Large Language Model ? [x1, …, x768] 1. Convert to embedding 2. Retrieve Documents 3. Add documents to context 4. Return Response

On LLM monitoring Tracing A trace consists of spans Spans
represent units of work. They consist of multiple events that together generate a final response. this response is sent back to the user. Trace #1: New Chat Span #1: Convert Query To Embedding Span #2: Retrieve Documents from Azure Search Span #3: Add documents to prompt Span #4: Generate final response

How to structure traces? We don’t have to reinvent the
wheel! Agent/LLM observability can be used by using existing logging patterns! We use OpenTelemetry (OTEL) to instrument our code. OTEL is an open-standard for Distributed Tracing. We “standardize” how we log any LLM interactions. we can use either LLM-specific tooling, or send the logs to existing observability software (Datadog, Prometheus)

How to decouple application and monitoring code? Use Callbacks! Callbacks
are specific collections of functions that are executed at a given time. We can easily change tooling if necessary. We can define multiple callbacks and log data to any tool that is compactible with the OTEL format callbacks = Callbacks([LangfuseCallback(),AzureCallback()]) def handle_request(request: Request) -> Response: callbacks.on_new_request(request.user_id, request.trace_id, request.question) # 1. retrieve documents. callbacks.on_retrieval_start(request.user_id, request.trace_id, request.question) docs = retrieve_documents() callbacks.on_retrieval_end(request.user_id, request.trace_id, docs) # 2. Generate response response = generate_llm_response() callbacks.on_llm_output(request.user_id, request.trace_id, request.question, response) return response 1. Define which callback you want to use 2. Call specific methods at intervals

Are my end-users actually happy with the results? Always incorporate
feedback mechanisms! Focus shifts more on feedback to evaluate LLM performance. Feedback mechanisms should be treated as a critical component of the platform.

On LLM monitoring Langfuse Langfuse is an open-source LLM observability
platform. Langfuse is easy to use and can be self-hosted. It has RBAC and integrates with Microsoft Entra ID. LangFuse uses a format very similar to OTEL, so it is very compactible. ü Easily monitor your traces, costs and feedback!

Easy deploy your own server Langfuse Deploying Langfuse is a
breeze. You can enable authentication based on Active Directory by creating an Azure App Registration and passing client credentials. ü Deploy as container app on Az container apps to host it as web-service.

Create a project And start monitoring ü Create a client!
ü Start tracing ü Inspect traces

Putting it together Frontend LLM Orchestrator Monitoring Stack LLM Endpoints
Vector Database Backend Database Data Layer Xebia Base RAG Xebia Base RAG

Wrapping up. LLM’s are awesome, RAG is very powerful, But
there are challenges in productionizing RAG. Which we can solve with a strong focus on observability and monitoring. Monitoring Stack ü Monitor costs ü Tracing ü Store feedback LLM Orchestrator ü Guardrails ü Grounding Costs Complex Traces & Flows Hallucinations Hard to evaluate LLM applications rely on complex, chained calls Costs can skyrocket easily Heterogenous outputs makes it hard to evaluate LLMs are unpredictable Capture full context of LLM execution Link individual outputs to costs Create strong feedback mechanisms Observe if the LLM makes mistakes Need Why The LLM challenges

xebia.com Jeroen Overschie Sander van Donkelaar AI Community Day 2024
A Practical Guide to LLMOps Platforms for RAG Applications on Azure From Prototype to Production:

From Prototype to Production -A Practical Guide...

From Prototype to Production -A Practical Guide to LLMOps Platforms for RAG Applications on Azure

Other Decks in Technology

Featured

Transcript