Increased velocity of code deployments with Kubernetes

Slide 1

Slide 1 text

Build Stuff 2023 11 16, Vilnius Increased velocity of code deployments with Kubernetes Edgaras Apšega Site Reliability Engineer

Slide 2

Slide 2 text

Vinted infrastructure Peak load transactions between users Kubernetes production stats ● All services running on Kubernetes* ● 20k+ running pods ● 700+ physical nodes (80k CPU cores; 330TB memory) *stateless

Slide 3

Slide 3 text

Velocity of deployments Vinted infrastructure

Slide 4

Slide 4 text

Velocity of deployments 2000 Vinted infrastructure

Slide 5

Slide 5 text

pre-Kubernetes deployments

Slide 6

Slide 6 text

Capistrano ● Gathers physical server list from Consul SD ● Does rolling update one-by-one (%) pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 7

Slide 7 text

ChatOps pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 8

Slide 8 text

Pain points ● Services isolation issues ● Physical servers with different CPUs ● Ruby dependencies management ● Custom in-house integrations pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 9

Slide 9 text

Kubernetes Best in-class container orchestrator

Slide 10

Slide 10 text

Kubernetes architecture ● Physical Kubernetes nodes (AMD CPUs, 128 cores each) ● 1 cluster stretched between 3 DCs that are close to each other and low latency between them ● separate etcd cluster for /events endpoint Kubernetes cluster Data center #2 Data center #3 Data center #1 Control plane VM #1 Control plane VM #2 Control plane VM #3 etcd VM #1 etcd VM #2 etcd VM #3 etcd /events VM #1 etcd /events VM #1 etcd /events VM #1 Kubernetes physical node Kubernetes physical node Kubernetes physical node pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 11

Slide 11 text

Kubernetes integration with Data center’s network BGP for the win! On the last Pi day Reddit had a 5 hours outage because after Kubernetes cluster upgrade route reflectors became unavailable Our (simplified) architecture: ● iBGP from server to Top Of the Rack switch ● Data center network deployed as BGP fabric for route exchange ● Every Kubernetes Node (server) has /24 network for it’s pods ● Service IPs advertised as anycast inside data center Leaf switch 10.1.1.254/24 Spine switch Spine switch Leaf switch 10.1.15.254/24 Kubernetes Node eth0: 10.1.1.12/24 br0: 10.100.23.254/24 Pod 10.100.23.15/24 Pod 10.100.23.123/24 Kubernetes Node eth0: 10.1.15.12/24 br0: 10.100.48.254/24 Pod 10.100.48.7/24 Pod 10.100.48.203/24 Advertise: 10.100.23.0/24 10.200.1.0/24 Advertise: 10.100.48.0/24 10.200.1.0/24 pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 12

Slide 12 text

GitOps deployments Allows faster automated deployments to production

Slide 13

Slide 13 text

Build Docker push Test Git commit and push Git clone config repo Update manifests kubectl apply Git clone config repo Discover manifests Continuous Integration Continuous Deployment

Slide 14

Slide 14 text

ArgoCD application view pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 15

Slide 15 text

Deployments velocity ● Deployment duration average is 20 minutes ● Core code releases to production every 20-30 minutes pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 16

Slide 16 text

What about rollbacks? ● Disable auto-sync for ArgoCD application ● Apply known good image tag ● Enable auto-sync for ArgoCD application ChatOps to the rescue! pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 17

Slide 17 text

GitOps canary deployments Increases reliability of deployments

Slide 18

Slide 18 text

Canary Deployments with analysis Argo Rollouts Plays well with ArgoCD Deployment strategies: Blue/Green Canary Analysis based on Prometheus, DataDog, etc. metrics pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 19

Slide 19 text

Canary Deployments in action pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 20

Slide 20 text

Canary Deployments in action (2) 3. new healthy version is live 1. rollback started 2. previous healthy version re-deployed pre-Kubernetes Kubernetes GitOps Canary deployments

Slide 21

Slide 21 text

Core monolith application deployments Deployments to production per day 30 Average duration of deployment 20 minutes Running during non-peak 1k pods Requests per second during the peak time 170k

Slide 22

Slide 22 text

All applications deployments Deployments to production per day 2000 Average duration of deployment 30 seconds Running during non-peak 20k pods Kubernetes services 650

Slide 23

Slide 23 text

Comparison: then and now Good old days Current solution ● Difficult to manage services running on physical servers (resource isolation, dependency management) ● Semi-automatic deployments with using ChatOps ● Kubernetes centralizes and automates the management of application isolation, handling resources and dependencies efficiently across physical servers. ● Fully automated deployments using GitOps workflow ● Automated rollbacks using canary rollouts pre-Kubernetes Kubernetes GitOps Canary deployments