Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecting and Building a K8s-based AI Platfo...

Architecting and Building a K8s-based AI Platform #CNN

Developing a scalable and production-ready AI platform poses significant challenges for organisations. Beyond a modular and flexible architecture, critical aspects such as infrastructure automation, orchestration, model deployment, and lifecycle management must be efficiently addressed. Kubernetes and open-source technologies provide a powerful foundation for tackling these challenges.

In this talk, we will explore the conceptual architecture and blueprint of a cloud-native AI platform, outlining the key design principles and best practices that enable scalability, automation, and reproducibility. We will then demonstrate how to build this platform step by step - both locally and in the public cloud - leveraging Kubernetes, open-source tools, and GitOps. The focus will be on creating a highly automated, repeatable, and production-ready environment for machine learning and AI workloads.

Keywords platform, productivity, kubernetes, open source, ai

Avatar for M.-Leander Reimer

M.-Leander Reimer PRO

November 20, 2025
Tweet

Resources

Tiny K3s AI Platform

https://github.com/lreimer/k3s-ai-platform

A tiny AI platform that is suitable to run locally on RD, k3s, Kind et.al

K8s-native AI Platform in GCP

https://github.com/lreimer/k8s-native-ai-platform

Demo repository for a Kubernetes-native AI platform.

More Decks by M.-Leander Reimer

Other Decks in Technology

Transcript

  1. "Too much cognitive load will become a bottleneck for fast

    flow and high productivity for many DevOps teams." Team Topologies: Organizing Business and Technology Teams for Fast Flow
  2. Platform engineering is the discipline of designing and building toolchains

    and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application. https://platformengineering.org/blog/what-is-platform-engineering
  3. An example reference architecture for an IDP. Developer Control Plane

    Integration and Delivery Plane Monitoring and Logging Plane Security Plane IDE Service Catalog / API Catalog Developer Portal Application Source Code Infrastructure & Platform Source Code Observability Secrets & Identity Manager CI Pipeline Registry CD Pipeline Resource Plane Compute Data Integration Networking Platform Orchestrator Certificates & Encryption GitOps https://humanitec.com/reference-architectures
  4. qaware.de ... and we have the perfect surfboard! The logical

    continuation: a. From applications to microservices to AI agents b. From on-prem to cloud platforms to AI platforms
  5. Micro-Agent GenAI Usage Prompts, Flow control Tools (MCP) Antwort enthält

    Aufrufe an OpenAI API ❏ Clear responsibility ❏ Vertical in terms of expertise ❏ manageably large ❏ potentially reusable Micro-Agent A2A AI agents will be implemented according to the microservice architecture paradigm. … … … Tool Server Business Logic LLM, LAM, SLM, domain-specific foundation models ? SSE HTTP
  6. "According to Gartner, 80% of AI PoCs fail on their

    way into productive use." https://www.qaware.de/ki-vom-proof-of-concept-poc-zur-entwicklung/
  7. The 80% Fallacy of AI projects. 14 QAware Juan Pablo

    Bottaro, LinkedIn Engineering Blog
  8. Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪

    Different challenges are seen depending on the maturity of the group ▪ AI newcomers often underestimate the complexity of technologies, models and tools ▪ Production and scaling challenges often hinder production readiness ▪ High cognitive load and lack of expertise are also drivers for failing projects 15
  9. Platform Plane Observability Operability Resource Plane Compute Data Integration Security

    Delivery FinOps Integration & Delivery Plane Quality Plane Data Plane Model Plane Compliance Plane Service Plane User Serving Plane Access Plane / APIs Orchestration Plane Data Modelling Plane
  10. The Kubernetes cluster topology requires precise planning. Otherwise the costs

    will go through the roof! 20 QAware ▪ There are different GPU machines ▪ Not all types are available in all regions ▪ Prices vary drastically, accurate research is recommended ▪ Additional local SSDs are recommended ▪ To be decided: – all nodes with GPU – different nodes optimised for normal as well as GPU workloads https://cloud.google.com/compute/gpus-pricing?hl=de#other-gpu-models
  11. Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

    Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.
  12. Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

    Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.
  13. Ready to go Agentic? Stay up-to-date with the Agentic Layer

    Newsletter! With your newsletter subscription, you not only stay up to date but also have the chance to win tickets for top tech conferences like the KI Navigator or the CLC. We look forward to continuing our discussion about Agentic AI with you!
  14. What’s your take on Agentic AI? Tell us where you're

    stuck or curious – and how you'd like to dive deeper into the topic.
  15. QAware GmbH | Aschauer Straße 30 | 81549 München |

    GF: Dr. Josef Adersberger, Michael Stehnken, Michael Rohleder, Mario-Leander Reimer Niederlassungen in München, Mainz, Rosenheim, Darmstadt | +49 89 232315-0 | [email protected] The next step? Let's talk! Mario-Leander Reimer Managing Director, CTO [email protected] +49 151 61314748