Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Prototype to Production -A Practical Guide...

From Prototype to Production -A Practical Guide to LLMOps Platforms for RAG Applications on Azure

GenAI is hot. Companies are showing increasing interest and are recognising its business value. Specifically, by supercharging LLM’s using a technique called RAG, LLM weaknesses like hallucination and out-of-date knowledge can be overcome. But how to bring such applications into production? What extra complexities should be taken into account when productionising LLM applications? What makes LLMOps different from MLOps?

This talk presents a practical approach to building End-to-End LLMOps platforms specifically for Retrieval Augmented Generation (RAG) applications on Azure. Using Infrastructure-as-Code (IaC) principles, it will be demonstrated how to use Terraform to a create secure, version-controlled, and reproducible platform.

A key focus will be addressing the distinct requirements within the LLMOps lifecycle, with a strong emphasis on LLM observability. Recognising the complexities of deploying LLM applications to production, the talk will highlight the critical role of robust observability mechanisms. It will do this by showcasing how to leverage open-source tooling such as LangFuse to tackle this. Moreover, the talk will highlight how to self-host LangFuse in your own environment, and how to incorporate the LangFuse SDK to add observability to your existing LLM applications.

The talk aims to aid ML engineers, data scientists, and platform engineers with the knowledge and tools to address the challenges of building and managing RAG applications in production environments, by using a solid set of principles that can be effectively implemented in their own projects.

Avatar for Jeroen Overschie

Jeroen Overschie

May 27, 2024
Tweet

Other Decks in Technology

Transcript

  1. A Practical Guide to LLMOps Platforms for RAG Applications on

    Azure From Prototype to Production: AI Community Day 2024 Jeroen Overschie Sander van Donkelaar
  2. About us Jeroen Overschie Machine Learning Engineer @ Xebia Data

    Sander van Donkelaar Machine Learning Engineer @ Xebia Data
  3. A Practical Guide to LLMOps Platforms for RAG Applications on

    Azure From Prototype to Production: Let’s go back to our title slide for a second …
  4. A Practical Guide to LLMOps Platforms for RAG Applications on

    Azure From Prototype to Production: Most importantly LLMOps and RAG. Why are those important?
  5. User acquisition for popular services in days medium.com Well, first

    things first: LLM’s are awesome So, why RAG? 👑 But still, this does not go without challenges.
  6. LLM Large Language Model ? ⤫ Outdated knowledge ⤫ No

    internal company data ⤫ Hallucinations Model interactions without RAG LLM Knowledge cut- off Gemini 1.0 Pro Early 2023 [1] GPT-3.5 turbo September 2021 [2] GPT-4 September 2021 [3] Knowledge cut-offs as of March 2024.
  7. LLM Large Language Model ? ü Up to date knowledge

    ü Internal company data ü Factual answers Documents ? Model interactions with RAG How RAG can supercharge your GenAI usage.
  8. LLM Large Language Model ? Documents ? Document retrieval 1

    Answer generation 2 The RAG components: Retrieval and Generation. RAG sounds great! What are the main components for RAG?
  9. Documents Doc to text Vector Database ……………………… ……………………… ……………………… ………………………

    ……………………… ……………………… Chunk text [x1, …, x768] [x1, …, x768] [x1, …, x768] Metadata & Embeddings [x1, …, x768] 🔗 • Title • Description • Publication date Metadata Database 🔍 🔗🔗 HTML </> TXT …… PDF …… Document references Convert search query to embedding ? API .txt, .pdf, .html Document retrieval 1 Retrieve documents Keyword search Vector search Matching document texts and metadata Question
  10. Blob Storage [x1, …, x768] 🔍 Convert search query to

    embedding ? .txt, .pdf, .html Document retrieval 1 Search query Bring your own embedding Question Performs keyword + vector search out of the box (Reciporical Rank Fusion) Azure AI Search text-embedding- ada-002 Optional steps Custom ingestion, Custom chunking, custom embeddings
  11. f“”” A question of a user will follow. Answer the

    question best to your ability, using the information provided in the documents. Only answer with truthful and citable information. The answer should be at the maximum five sentences. Question: '{query}’ Documents: { format_doc(doc) for each doc in docs } “”” Answer generation 2 LLM Large Language Model ? Container App Prompt Answer Question GPT-3.5 or GPT 4
  12. LLM Large Language Model ? Documents ? Document retrieval 1

    Answer generation 2 Great! We understood the main components for Retrieval Augmented Generation on Azure How to productionize?
  13. We always need to follow these principles When Going to

    Production IaC CI/CD Observability Faster go to market times Higher Auditability Lower Operational Costs Replace all manual effort with Terraform Every component is monitored All code is provisioned using CI/CD
  14. A bit about Terraform: a declarative way to provision your

    cloud infrastructure ü Automated ü Reproducible You define your infrastructure as code (IaC) Terraform will handle the deployment for you
  15. Hallucinations Complex Traces & Flows Costs Hard to evaluate LLM

    applications rely on complex, chained calls LLMs are unpredictable Heterogenous outputs makes it hard to evaluate Costs can skyrocket easily Capture full context of LLM execution Observability & Guardrails Create strong feedback mechanisms Link individual outputs to costs LLM applications have unique challenges when productionizing Need Why
  16. Component #1 We need to have a front-end Frontend LLM

    Orchestrator Monitoring Stack LLM Endpoints Vector Database End users should be able to communicate with our LLM application If we containerize it, deployment becomes easy! ü Easily deploy as (secured) web service using Azure Container Apps! ! We used react, but use anything you want
  17. Component #2 We need to have a LLM orchestrator (backend)

    Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database The LLM orchestrator acts as “middleman” It orchestrates subsequent calls to LLM endpoints and Vector Databases Incorporate logic to filter & scan model outputs (Guardrails) ü Easily deploy on Azure Container Apps
  18. Component #3 We need enable our LLM endpoints Frontend LLM

    Orchestrator Monitoring Stack LLM Endpoints Vector Database Endpoints to generate completions and embeddings should be enabled.
  19. Component #4 We need Azure AI Search for our vector

    database Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database We use Azure AI Search as our vector database. This allows us to connect (and index) different data sources. But it is much more!
  20. Component #5 We need a way of making our LLM

    applications observable Frontend LLM Orchestrator Monitoring Stack LLM Endpoints Vector Database We need a strong monitoring component for visibility into every step in our LLM application.
  21. ? ⤫ Are the documents added correctly to the prompt?

    ⤫ Is my LLM actually basing its answer on the documents? How to monitor your AI APP without visibility? My AI application
  22. This is still a simple example LLM interactions can become

    very complex! ? Rephrase Query Route Query to index [x1, …, x768] Convert to embedding Retrieve Documents Generate response Evaluate response …
  23. On LLM monitoring Tracing You want need to capture the

    full context of LLM execution. A trace outlines the different steps and results in an execution chain LLM Large Language Model ? [x1, …, x768] 1. Convert to embedding 2. Retrieve Documents 3. Add documents to context 4. Return Response
  24. On LLM monitoring Tracing A trace consists of spans Spans

    represent units of work. They consist of multiple events that together generate a final response. this response is sent back to the user. Trace #1: New Chat Span #1: Convert Query To Embedding Span #2: Retrieve Documents from Azure Search Span #3: Add documents to prompt Span #4: Generate final response
  25. How to structure traces? We don’t have to reinvent the

    wheel! Agent/LLM observability can be used by using existing logging patterns! We use OpenTelemetry (OTEL) to instrument our code. OTEL is an open-standard for Distributed Tracing. We “standardize” how we log any LLM interactions. we can use either LLM-specific tooling, or send the logs to existing observability software (Datadog, Prometheus)
  26. How to decouple application and monitoring code? Use Callbacks! Callbacks

    are specific collections of functions that are executed at a given time. We can easily change tooling if necessary. We can define multiple callbacks and log data to any tool that is compactible with the OTEL format callbacks = Callbacks([LangfuseCallback(),AzureCallback()]) def handle_request(request: Request) -> Response: callbacks.on_new_request(request.user_id, request.trace_id, request.question) # 1. retrieve documents. callbacks.on_retrieval_start(request.user_id, request.trace_id, request.question) docs = retrieve_documents() callbacks.on_retrieval_end(request.user_id, request.trace_id, docs) # 2. Generate response response = generate_llm_response() callbacks.on_llm_output(request.user_id, request.trace_id, request.question, response) return response 1. Define which callback you want to use 2. Call specific methods at intervals
  27. Are my end-users actually happy with the results? Always incorporate

    feedback mechanisms! Focus shifts more on feedback to evaluate LLM performance. Feedback mechanisms should be treated as a critical component of the platform.
  28. On LLM monitoring Langfuse Langfuse is an open-source LLM observability

    platform. Langfuse is easy to use and can be self-hosted. It has RBAC and integrates with Microsoft Entra ID. LangFuse uses a format very similar to OTEL, so it is very compactible. ü Easily monitor your traces, costs and feedback!
  29. Easy deploy your own server Langfuse Deploying Langfuse is a

    breeze. You can enable authentication based on Active Directory by creating an Azure App Registration and passing client credentials. ü Deploy as container app on Az container apps to host it as web-service.
  30. Putting it together Frontend LLM Orchestrator Monitoring Stack LLM Endpoints

    Vector Database Backend Database Data Layer Xebia Base RAG Xebia Base RAG
  31. Wrapping up. LLM’s are awesome, RAG is very powerful, But

    there are challenges in productionizing RAG. Which we can solve with a strong focus on observability and monitoring. Monitoring Stack ü Monitor costs ü Tracing ü Store feedback LLM Orchestrator ü Guardrails ü Grounding Costs Complex Traces & Flows Hallucinations Hard to evaluate LLM applications rely on complex, chained calls Costs can skyrocket easily Heterogenous outputs makes it hard to evaluate LLMs are unpredictable Capture full context of LLM execution Link individual outputs to costs Create strong feedback mechanisms Observe if the LLM makes mistakes Need Why The LLM challenges
  32. xebia.com Jeroen Overschie Sander van Donkelaar AI Community Day 2024

    A Practical Guide to LLMOps Platforms for RAG Applications on Azure From Prototype to Production: