Increased velocity of code deployments with Kubernetes

Build Stuff 2023 11 16, Vilnius Increased velocity of code
deployments with Kubernetes Edgaras Apšega Site Reliability Engineer

Vinted infrastructure Peak load transactions between users Kubernetes production stats
• All services running on Kubernetes* • 20k+ running pods • 700+ physical nodes (80k CPU cores; 330TB memory) *stateless

Velocity of deployments Vinted infrastructure

Velocity of deployments 2000 Vinted infrastructure

pre-Kubernetes deployments

Capistrano • Gathers physical server list from Consul SD •
Does rolling update one-by-one (%) pre-Kubernetes Kubernetes GitOps Canary deployments

ChatOps pre-Kubernetes Kubernetes GitOps Canary deployments

Pain points • Services isolation issues • Physical servers with
different CPUs • Ruby dependencies management • Custom in-house integrations pre-Kubernetes Kubernetes GitOps Canary deployments

Kubernetes Best in-class container orchestrator

Kubernetes architecture • Physical Kubernetes nodes (AMD CPUs, 128 cores
each) • 1 cluster stretched between 3 DCs that are close to each other and low latency between them • separate etcd cluster for /events endpoint Kubernetes cluster Data center #2 Data center #3 Data center #1 Control plane VM #1 Control plane VM #2 Control plane VM #3 etcd VM #1 etcd VM #2 etcd VM #3 etcd /events VM #1 etcd /events VM #1 etcd /events VM #1 Kubernetes physical node Kubernetes physical node Kubernetes physical node pre-Kubernetes Kubernetes GitOps Canary deployments

Kubernetes integration with Data center’s network BGP for the win!
On the last Pi day Reddit had a 5 hours outage because after Kubernetes cluster upgrade route reflectors became unavailable Our (simplified) architecture: • iBGP from server to Top Of the Rack switch • Data center network deployed as BGP fabric for route exchange • Every Kubernetes Node (server) has /24 network for it’s pods • Service IPs advertised as anycast inside data center Leaf switch 10.1.1.254/24 Spine switch Spine switch Leaf switch 10.1.15.254/24 Kubernetes Node eth0: 10.1.1.12/24 br0: 10.100.23.254/24 Pod 10.100.23.15/24 Pod 10.100.23.123/24 Kubernetes Node eth0: 10.1.15.12/24 br0: 10.100.48.254/24 Pod 10.100.48.7/24 Pod 10.100.48.203/24 Advertise: 10.100.23.0/24 10.200.1.0/24 Advertise: 10.100.48.0/24 10.200.1.0/24 pre-Kubernetes Kubernetes GitOps Canary deployments

GitOps deployments Allows faster automated deployments to production

Build Docker push Test Git commit and push Git clone
config repo Update manifests kubectl apply Git clone config repo Discover manifests Continuous Integration Continuous Deployment

ArgoCD application view pre-Kubernetes Kubernetes GitOps Canary deployments

Deployments velocity • Deployment duration average is 20 minutes •
Core code releases to production every 20-30 minutes pre-Kubernetes Kubernetes GitOps Canary deployments

What about rollbacks? • Disable auto-sync for ArgoCD application •
Apply known good image tag • Enable auto-sync for ArgoCD application ChatOps to the rescue! pre-Kubernetes Kubernetes GitOps Canary deployments

GitOps canary deployments Increases reliability of deployments

Canary Deployments with analysis Argo Rollouts Plays well with ArgoCD
Deployment strategies: Blue/Green Canary Analysis based on Prometheus, DataDog, etc. metrics pre-Kubernetes Kubernetes GitOps Canary deployments

Canary Deployments in action pre-Kubernetes Kubernetes GitOps Canary deployments

Canary Deployments in action (2) 3. new healthy version is
live 1. rollback started 2. previous healthy version re-deployed pre-Kubernetes Kubernetes GitOps Canary deployments

Core monolith application deployments Deployments to production per day 30
Average duration of deployment 20 minutes Running during non-peak 1k pods Requests per second during the peak time 170k

All applications deployments Deployments to production per day 2000 Average
duration of deployment 30 seconds Running during non-peak 20k pods Kubernetes services 650

Comparison: then and now Good old days Current solution •
Difficult to manage services running on physical servers (resource isolation, dependency management) • Semi-automatic deployments with using ChatOps • Kubernetes centralizes and automates the management of application isolation, handling resources and dependencies efficiently across physical servers. • Fully automated deployments using GitOps workflow • Automated rollbacks using canary rollouts pre-Kubernetes Kubernetes GitOps Canary deployments

What’s next?

Thank you! Add me to LinkedIn!

Increased velocity of code deployments with K...

Increased velocity of code deployments with Kubernetes

Edgaras Apšega

More Decks by Edgaras Apšega

Other Decks in Technology

Featured

Transcript

Build Stuff 2023 11 16, Vilnius Increased velocity of code

Vinted infrastructure Peak load transactions between users Kubernetes production stats

Velocity of deployments Vinted infrastructure

Velocity of deployments 2000 Vinted infrastructure

pre-Kubernetes deployments

Capistrano • Gathers physical server list from Consul SD •

ChatOps pre-Kubernetes Kubernetes GitOps Canary deployments

Pain points • Services isolation issues • Physical servers with

Kubernetes Best in-class container orchestrator

Kubernetes architecture • Physical Kubernetes nodes (AMD CPUs, 128 cores

Kubernetes integration with Data center’s network BGP for the win!

GitOps deployments Allows faster automated deployments to production

Build Docker push Test Git commit and push Git clone

ArgoCD application view pre-Kubernetes Kubernetes GitOps Canary deployments

Deployments velocity • Deployment duration average is 20 minutes •

What about rollbacks? • Disable auto-sync for ArgoCD application •

GitOps canary deployments Increases reliability of deployments

Canary Deployments with analysis Argo Rollouts Plays well with ArgoCD

Canary Deployments in action pre-Kubernetes Kubernetes GitOps Canary deployments

Canary Deployments in action (2) 3. new healthy version is

Core monolith application deployments Deployments to production per day 30

All applications deployments Deployments to production per day 2000 Average

Comparison: then and now Good old days Current solution •

What’s next?

Thank you! Add me to LinkedIn!