Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLMariner - Transform your Kubernetes Cluster I...

LLMariner - Transform your Kubernetes Cluster Into a GenAI platform

This presentation gives an overview of LLMariner.

Kenji Kaneda

October 11, 2024
Tweet

Other Decks in Technology

Transcript

  1. © 2024 CloudNatix, All Rights Reserved LLMariner Provide a unified

    AI/ML platform with efficient GPU and K8s management LLMariner LLM (inference, fine-tuning, RAG) Workbench (Jupyter Notebook) Non-LLM Training public/private cloud GPUs (gen G1, arch A1) GPUs (gen G2, arch A2) public/private cloud GPUs (gen G1, arch A1) GPUs (gen G2, arch A2) public/private cloud GPUs (gen G1, arch A1) GPUs (gen G2, arch A2)
  2. © 2024 CloudNatix, All Rights Reserved Example Use Cases •

    Develop LLM applications with the API that is compatible with OpenAI API ◦ Leverage existing ecosystem to build applications • Fine-tune models while keeping data safely and securely in your on-premise datacenter Code auto-completion Chat bot
  3. © 2024 CloudNatix, All Rights Reserved Key Features • LLM

    Inference • LLM fine-tuning • RAG • Jupyter Notebook • General-purpose training • Flexible deployment model • Efficient GPU management • Security / access control • GPU visibility/showback (*) • Highly-reliable GPU management (*) For AI/ML team For infrastructure team (*) under development
  4. © 2024 CloudNatix, All Rights Reserved High Level Architecture Worker

    GPU K8s cluster LLMariner Agent for AI/ML Worker GPU K8s cluster LLMariner Agent for AI/ML Worker GPU K8s cluster LLMariner Agent for AI/ML Control plane K8s cluster LLMariner Control Plane for AI/ML API endpoint
  5. © 2024 CloudNatix, All Rights Reserved Features for AI/ML team

    and infra team APIs for the AI/ML team K8s cluster OpenAI-compatible API (chat completion, embedding, RAG, fine-tuning, …) Workbench with Jupyter Notebooks Inference engine User mgmt General purpose training jobs Cluster federation GPU workloads mgmt Storage mgmt Model mgmt Open models Closed models owned by your org Fine-tuned models Runtime mgmt (e.g., autoscaling, routing) vLLM Nvidia Triton Ollama Fine-tuning jobs API usage audits K8s cluster K8s cluster Files Vector DBs Jupyter Notebooks Training jobs Kueue Dex API authn/authz API key mgmt Orgs & projects mgmt Cluster mgmt Secure session mgmt
  6. © 2024 CloudNatix, All Rights Reserved LLM Inference Serving •

    Compatible with OpenAI API ◦ Can leverage the existing ecosystem and applications • Advanced capabilities surpassing standard inference runtimes, such as vLLM ◦ Optimized request serving and GPU management ◦ Multiple inference runtime support ◦ Multiple model support ◦ Built-in RAG integration
  7. © 2024 CloudNatix, All Rights Reserved Multiple Model and Runtime

    Support • Multiple model support • Multiple inference runtime support Open models from Hugging Face Private models in customers’ environment Fine-tuned models generated with LLMariner vLLM Ollama Nvidia Triton Inference Server Hugging Face TGI Upcoming Experimental
  8. © 2024 CloudNatix, All Rights Reserved Cluster X Optimized Inference

    Serving • Efficiently utilize GPU to achieve high throughput and low latency • Key technologies: ◦ Autoscaling ◦ Model-aware request load balancing & routing ◦ Multi-model management & caching ◦ Multi-cluster/cloud federation LLMariner Inference Manager Engine vLLM Llama 3.1 vLLM Gemma 2 Autoscaling vLLM Llama 3.1 Ollama Deepseek Coder Cluster Y
  9. © 2024 CloudNatix, All Rights Reserved Built-in RAG Integration •

    Use API compatible OpenAI to manage vector stores and files ◦ Use Milvus as an underlying vector DB • Inference engine retrieves relevant data when processing requests File File File Upload and create embeddings LLMariner Inference Engine Retrieve data
  10. © 2024 CloudNatix, All Rights Reserved GPU K8s cluster Beyond

    LLM Inference • Provide LLM fine-tuning, general-purpose training, and Jupyter Notebook management • Empower AI/ML teams to harness the full power of GPUs in a secure self-contained environment Supervised Fine-tuning Trainer
  11. © 2024 CloudNatix, All Rights Reserved A Fine-tuning Example •

    Submit a fine-tuning job using the OpenAI Python library ◦ Fine-tuned job runs in an underlying Kubernetes cluster • Enforce quota with integration with open source Kueue K8s cluster GPU GPU GPU GPU Fine-tuning job Fine-tuning job Quota enforcement with Kueue submit
  12. © 2024 CloudNatix, All Rights Reserved Project X Enterprise-Ready Access

    Control • Control API scope with “organizations” and “projects” ◦ A user in Project X can access fine-tuned models generated by other users in project X ◦ A user in Project Y cannot access the fine-tuned models in X • Can be integrated with a customer’s identity management platform (e.g., SAML, OIDC) Project Y User 1 User 2 Fine-tuned model User 3 create read cannot access
  13. © 2024 CloudNatix, All Rights Reserved Supported Deployment Models Single

    public cloud Single private cloud Air-gapped env Appliance Hybrid cloud (public & private) Multi-cloud federation Private cloud Public cloud LLMariner Control Plane LLMariner Agent Cloud Y Cloud A LLMariner Control Plane K8s cluster LLMariner Control Plane LLMariner Agent Cloud Y Cloud B LLMariner Agent ※ No need to open incoming ports in worker clusters, only outgoing port 443 is required