Architecting and Building a K8s-based AI Platform #CNS25

qaware.de Architecting and Building a K8s-based AI Platform Mario-Leander Reimer
[email protected] @LeanderReimer @qaware #CloudNativeNerd #gerneperdude

2 Mario-Leander Reimer Managing Director | CTO @LeanderReimer #cloudnativenerd #qaware
#gernperDude

3 QAware 2019

"Too much cognitive load will become a bottleneck for fast
ﬂow and high productivity for many DevOps teams." Team Topologies: Organizing Business and Technology Teams for Fast Flow

Platform engineering is the discipline of designing and building toolchains
and workﬂows that enable self-service capabilities for software engineering organizations in the cloud-native era. Platform engineers provide an integrated product most often referred to as an “Internal Developer Platform” covering the operational necessities of the entire lifecycle of an application. https://platformengineering.org/blog/what-is-platform-engineering

An example reference architecture for an IDP. Developer Control Plane
Integration and Delivery Plane Monitoring and Logging Plane Security Plane IDE Service Catalog / API Catalog Developer Portal Application Source Code Infrastructure & Platform Source Code Observability Secrets & Identity Manager CI Pipeline Registry CD Pipeline Resource Plane Compute Data Integration Networking Platform Orchestrator Certificates & Encryption GitOps https://humanitec.com/reference-architectures

7 QAware 2025

qaware.de A wave is coming!

qaware.de Agentic AI Software engineering agents Domain speciﬁc agentic workloads

qaware.de ... and we have the perfect surfboard! The logical
continuation: a. From applications to microservices to AI agents b. From on-prem to cloud platforms to AI platforms

Micro-Agent GenAI Usage Prompts, Flow control Tools (MCP) Antwort enthält
Aufrufe an OpenAI API ❏ Clear responsibility ❏ Vertical in terms of expertise ❏ manageably large ❏ potentially reusable Micro-Agent A2A AI agents will be implemented according to the microservice architecture paradigm. … … … Tool Server Business Logic LLM, LAM, SLM, domain-speciﬁc foundation models ? SSE HTTP

Why do we need an AI platform?

"According to Gartner, 80% of AI PoCs fail on their
way into productive use." https://www.qaware.de/ki-vom-proof-of-concept-poc-zur-entwicklung/

The 80% Fallacy of AI projects. 14 QAware Juan Pablo
Bottaro, LinkedIn Engineering Blog

Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪
Different challenges are seen depending on the maturity of the group ▪ AI newcomers often underestimate the complexity of technologies, models and tools ▪ Production and scaling challenges often hinder production readiness ▪ High cognitive load and lack of expertise are also drivers for failing projects 15

Our proposal for an AI Platform Reference Architecture

Platform Plane Observability Operability Resource Plane Compute Data Integration Security
Delivery FinOps Integration & Delivery Plane Quality Plane Data Plane Model Plane Compliance Plane Service Plane User Serving Plane Access Plane / APIs Orchestration Plane Data Modelling Plane

lreimer/k8s-native-ai-platform lreimer/k3s-ai-platform

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane
Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.

The Kubernetes cluster topology requires precise planning. Otherwise the costs
will go through the roof! 21 QAware ▪ There are different GPU machines ▪ Not all types are available in all regions ▪ Prices vary drastically, accurate research is recommended ▪ Additional local SSDs are recommended ▪ To be decided: – all nodes with GPU – different nodes optimised for normal as well as GPU workloads https://cloud.google.com/compute/gpus-pricing?hl=de#other-gpu-models

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane
Operability Resource Plane Compute Data: Local SSD Integration Security Delivery FinOps Quality Plane Data Plane Model Plane User Serving Plane Access Plane Data Modelling Pl.

Meet the Agentic Layer. Your Control Plane for Intelligent Workloads.

Ready to go Agentic? Want to meet the Agentic Layer
and discuss how to turn isolated AI agents into an enterprise-grade workforce? Visit us at our booth Meet us on the ground floor, dive deeper, get scanned – and you might just win tickets to top events like KI Navigator or CLC! CloudNativeNight – July 31 Take the chance and continue the Agentic Layer discussion with our community. Let’s build the future of AI together!

What’s your take on Agentic AI? Tell us where you're
stuck or curious – and how you'd like to dive deeper into the topic.

QAware GmbH | Aschauer Straße 30 | 81549 München |
GF: Dr. Josef Adersberger, Michael Stehnken, Michael Rohleder, Mario-Leander Reimer Niederlassungen in München, Mainz, Rosenheim, Darmstadt | +49 89 232315-0 | [email protected] The next step? Let's talk! Mario-Leander Reimer Managing Director, CTO [email protected] +49 151 61314748

Architecting and Building a K8s-based AI Platfo...

Architecting and Building a K8s-based AI Platform #CNS25

M.-Leander Reimer PRO

Video

More Decks by M.-Leander Reimer

Other Decks in Technology

Featured

Transcript

qaware.de Architecting and Building a K8s-based AI Platform Mario-Leander Reimer

2 Mario-Leander Reimer Managing Director | CTO @LeanderReimer #cloudnativenerd #qaware

3 QAware 2019

"Too much cognitive load will become a bottleneck for fast

Platform engineering is the discipline of designing and building toolchains

An example reference architecture for an IDP. Developer Control Plane

7 QAware 2025

qaware.de A wave is coming!

qaware.de Agentic AI Software engineering agents Domain speciﬁc agentic workloads

qaware.de ... and we have the perfect surfboard! The logical

Micro-Agent GenAI Usage Prompts, Flow control Tools (MCP) Antwort enthält

Why do we need an AI platform?

"According to Gartner, 80% of AI PoCs fail on their

The 80% Fallacy of AI projects. 14 QAware Juan Pablo

Key challenges: technology, models and tools, scaling. Source: https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year ▪

Our proposal for an AI Platform Reference Architecture

Platform Plane Observability Operability Resource Plane Compute Data Integration Security

lreimer/k8s-native-ai-platform lreimer/k3s-ai-platform

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

The Kubernetes cluster topology requires precise planning. Otherwise the costs

Compliance Plane Integration & Delivery Plane Service Plane Platform Plane

Meet the Agentic Layer. Your Control Plane for Intelligent Workloads.

Ready to go Agentic? Want to meet the Agentic Layer

What’s your take on Agentic AI? Tell us where you're

QAware GmbH | Aschauer Straße 30 | 81549 München |