Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CloudConf 2026 - Self Healing Production Rollouts

Avatar for Kevin Dubois Kevin Dubois
May 06, 2026
37

CloudConf 2026 - Self Healing Production Rollouts

Avatar for Kevin Dubois

Kevin Dubois

May 06, 2026

Transcript

  1. Self-Healing Rollouts: Automating Production Fixes with Agentic AI Kevin Dubois

    Sr Principal Dev Advocate IBM Natale Vinto Technical Director, Evangelism Red Hat
  2. Kevin Dubois ★ Sr. Principal Developer Advocate at ★ Java

    Champion ★ Technical Lead, CNCF DevEx TAG ★ From Belgium 󰎐 / Live in Switzerland󰎤 ★ 🗣 English, Dutch, French, Italian youtube.com/@thekevindubois linkedin.com/in/kevindubois github.com/kdubois @kevindubois.com
  3. 3

  4. Art of Seamless Deployments! Who doesn't love a smooth rollout?

    No more deployment drama, just Argo-mazing progress!. #ContinuousDeploymentChronicles
  5. Familiar Progressive Deployment Strategies BlueGreen LB Application Version 1 Application

    Version 2 LB Application Version 1 Application Version 2 LB Application Version 1 Application Version 2 LB Initial Deployment Deploy New Version Switch Traffic Finish 1 2 3 4 Advanced Deployment Strategies
  6. Familiar Progressive Deployment Strategies Canary LB Application Version 1 Application

    Version 2 LB Application Version 1 Application Version 2 LB Application Version 1 Application Version 2 LB Initial Deployment New Version 10% Traffic New Version 33% Traffic New Version All Traffic 1 2 3 4 10% 33% 100% Advanced Deployment Strategies
  7. But the real challenge is GitOps can’t do progressive delivery!!

    “Swiss cheese” approach: Code review, CIs, Staging, Generic alerting but missing link after deployment Source Git Repository Config Git Repository Image Registry Kubernetes C D Pull Request Push Pull AppDev Administrator Push Code Pull Request
  8. Rollout Big Picture Rollout Deployed using the Operator apiVersion: argoproj.io/v1beta1

    kind: RolloutManager metadata: name: argo-rollout-manager namespace: basic{} GitOps Operator
  9. Rollout Manager Deployed using the Operator apiVersion: argoproj.io/v1beta1 kind: RolloutManager

    metadata: name: argo-rollout-manager namespace: basic{} GitOps Operator Rollout Controller Rollout Watch image: v1 Active/ Stable Service Preview/ Canary Service Old ReplicaSet New ReplicaSet Creates and manages To deploy an application, it needs: • Rollout • 2 Services that the Rollout Controller will manage
  10. ReplicaSet Migrating a Deployment to Rollout Reference the Deployment resource

    in Rollout object No downtime and reversible, if done in steps Deployment Rollout References ReplicaSet apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: example-rollout-canary spec: replicas: 4 selector: matchLabels: app: guestbook …… workloadRef: apiVersion: apps/v1 kind: Deployment name: rollout-ref-deployment Pod Pod Pod Pod Manages In existing Deployments
  11. Metrics Based Rollouts strategy: canary: analysis: args: - name: service-name

    value: rollouts-demo-canary.canary.svc.cluster.local templates: - templateName: success-rate canaryService: rollouts-demo-canary stableService: rollouts-demo-stable trafficRouting: istio: virtualService: name: rollout-vsvc routes: - primary steps: - setWeight: 30 - pause: { duration: 20s } - setWeight: 40 - pause: { duration: 10s } - setWeight: 60 - pause: { duration: 10s } - setWeight: 80 - pause: { duration: 5s } - setWeight: 90 - pause: { duration: 5s } - setWeight: 100 - pause: { duration: 5s }
  12. apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: success-rate spec: args: -

    name: service-name metrics: - name: success-rate interval: 10s successCondition: len(result) == 0 || result[0] >= 0.95 failureLimit: 2 provider: prometheus: address: https://internal:[email protected] .local:9090 query: | sum(irate(istio_requests_total{ reporter="source", destination_service=~"{{args.service-name}}", response_code!~"5.*"}[30s]) )
  13. apiVersion: argoproj.io/v1alpha1 kind: RolloutManager metadata: name: argo-rollouts spec: plugins: metric:

    - name: argoproj-labs/metric-ai location: https://github.com/argoproj-labs/rollouts-plugin-metric-ai/releases/ download/v0.0.1/rollouts-plugin-metric-ai-linux-amd64
  14. apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: canary-analysis-ai-agent spec: metrics: -

    interval: 10s name: success-rate provider: plugin: argoproj-labs/metric-ai: agentUrl: http://kubernetes-agent:8080 stableLabel: role=stable canaryLabel: role=canary extraPrompt: ignore aesthetic changes successCondition: result > 0.50
  15. apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: canary-analysis-ai-agent spec: metrics: -

    interval: 10s name: success-rate provider: plugin: argoproj-labs/metric-ai: agentUrl: http://kubernetes-agent:8080 stableLabel: role=stable canaryLabel: role=canary extraPrompt: ignore aesthetic changes githubUrl: …github.com/kdubois/argo-rollouts-quarkus-demo
  16. Lessons learned: Performance was an interesting challenge: went from one

    AI service to agentic system: parallel & async agents LLM choice + “context engineering” + tool calling especially for PR creation Complexity vs portability (e.g. could’ve used Serverless MCP, external code assistant for PR creation, async remote agents, etc.)
  17. Takeaways: Rolling out changes to all users at once is

    risky Canary rollouts and feature flags are safer AI Agents can automate the loop by analyzing metrics and logs, and even proposing fixes for the failures AI != Python !!! Java with Quarkus is powerful for enterprise AI
  18. Resources: GitOps Cookbook eBook The GitOps Cookbook presents useful recipes

    and examples to follow GitOps practices on Kubernetes. Authors Natale Vinto and Alex Soto Bueno walk you through the necessary steps for successful hands-on applications development and deployment with GitOps.