of a replicated application that are down simultaneously from voluntary disruptions (e.g., drain, autoscaling) • work well with Deployment or Replicaset • Specify minAvailable or maxAvailable by the number of Pods or percentage • Application must be running on multiple Pods https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
50% (ALLOWED DISRUPTIONS: 1) • kubectl drain node-02 keeps at least two ‘C’ Pods available node-01 A A B node-02 C node-03 B A C SchedulingDisabled C C
C node-03 B A C C SchedulingDisabled C • PDB for C: minAvailable: 50% (ALLOWED DISRUPTIONS: 1) • kubectl drain node-02 keeps at least two ‘C’ Pods available
node-03 B A C C SchedulingDisabled C C • PDB for C: minAvailable: 50% (ALLOWED DISRUPTIONS: 1) • kubectl drain node-02 keeps at least two ‘C’ Pods available
control plane downtime ◦ ~ 2 minutes ◦ kubectl stops working ◦ CronJob will not be kicked • Regional (HA masters) cluster can be upgrade with zero-downtime ◦ Existing zonal -> regional migration is not supported...
one by one • Behavior of “Automatic node upgrades” • Make sure that cluster capacity is enough even 1 node is down node-01 v1.10 node-02 v1.10 node-03 v1.10 node-01 v1.11 node-02 v1.10 node-03 v1.10
node pools with new Kubernetes version • Migrate all Pods on old nodes to new nodes ◦ kubectl cordon ◦ kubectl drain • Delete old node pools • Make sure that enough GCP resource capacity is prepared https://cloudplatform.googleblog.com/2018/06/Kubernetes-best-practices-upgrading-your-clusters-with-zero-downtime.html
resources 2. Create PDBs for all microservices 3. Upgrade! a. Upgrade GKE master (Terraform) b. Create GKE node pools (Terraform) c. Migrate all Pods (script) d. Delete old GKE node pools (Terraform)
by gcloud command • We want to manage GKE (and all other cloud) resources on GitHub! ◦ git log as operation history ◦ review changes by other team members ◦ modify resources from CI without manual operation