Develop LLM applications with the API that is compatible with OpenAI API ◦ Leverage existing ecosystem to build applications • Fine-tune models while keeping data safely and securely in your on-premise datacenter Code auto-completion Chat bot
Inference • LLM fine-tuning • RAG • Jupyter Notebook • General-purpose training • Flexible deployment model • Efficient GPU management • Security / access control • GPU visibility/showback (*) • Highly-reliable GPU management (*) For AI/ML team For infrastructure team (*) under development
GPU K8s cluster LLMariner Agent for AI/ML Worker GPU K8s cluster LLMariner Agent for AI/ML Worker GPU K8s cluster LLMariner Agent for AI/ML Control plane K8s cluster LLMariner Control Plane for AI/ML API endpoint
and infra team APIs for the AI/ML team K8s cluster OpenAI-compatible API (chat completion, embedding, RAG, fine-tuning, …) Workbench with Jupyter Notebooks Inference engine User mgmt General purpose training jobs Cluster federation GPU workloads mgmt Storage mgmt Model mgmt Open models Closed models owned by your org Fine-tuned models Runtime mgmt (e.g., autoscaling, routing) vLLM Nvidia Triton Ollama Fine-tuning jobs API usage audits K8s cluster K8s cluster Files Vector DBs Jupyter Notebooks Training jobs Kueue Dex API authn/authz API key mgmt Orgs & projects mgmt Cluster mgmt Secure session mgmt
Compatible with OpenAI API ◦ Can leverage the existing ecosystem and applications • Advanced capabilities surpassing standard inference runtimes, such as vLLM ◦ Optimized request serving and GPU management ◦ Multiple inference runtime support ◦ Multiple model support ◦ Built-in RAG integration
Support • Multiple model support • Multiple inference runtime support Open models from Hugging Face Private models in customers’ environment Fine-tuned models generated with LLMariner vLLM Ollama Nvidia Triton Inference Server Hugging Face TGI Upcoming Experimental
Use API compatible OpenAI to manage vector stores and files ◦ Use Milvus as an underlying vector DB • Inference engine retrieves relevant data when processing requests File File File Upload and create embeddings LLMariner Inference Engine Retrieve data
LLM Inference • Provide LLM fine-tuning, general-purpose training, and Jupyter Notebook management • Empower AI/ML teams to harness the full power of GPUs in a secure self-contained environment Supervised Fine-tuning Trainer
Submit a fine-tuning job using the OpenAI Python library ◦ Fine-tuned job runs in an underlying Kubernetes cluster • Enforce quota with integration with open source Kueue K8s cluster GPU GPU GPU GPU Fine-tuning job Fine-tuning job Quota enforcement with Kueue submit
Control • Control API scope with “organizations” and “projects” ◦ A user in Project X can access fine-tuned models generated by other users in project X ◦ A user in Project Y cannot access the fine-tuned models in X • Can be integrated with a customer’s identity management platform (e.g., SAML, OIDC) Project Y User 1 User 2 Fine-tuned model User 3 create read cannot access
public cloud Single private cloud Air-gapped env Appliance Hybrid cloud (public & private) Multi-cloud federation Private cloud Public cloud LLMariner Control Plane LLMariner Agent Cloud Y Cloud A LLMariner Control Plane K8s cluster LLMariner Control Plane LLMariner Agent Cloud Y Cloud B LLMariner Agent ※ No need to open incoming ports in worker clusters, only outgoing port 443 is required