Why We Should Automate? If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. Carla Geisser, Google SRE Chapter 5 - Eliminating Toil “ ”
50 clusters 2018 Do Not Work O(N) Manual Operations with Service Growth Kubernetes Clusterͷཧো࣌ͷ෮ چΞοϓάϨʔυ͕͋Δ͕ɺαʔϏεͷ ʹ߹Θͤͯ࡞ۀO(N)Ͱ૿ͯ͠͠ ·͏ͱཧ͖͠Εͳ͍ Automation is Required ϠϑʔࣾͰͷར༻࣮˞ ※ Cloud Native Days Tokyo 2019 ʮϠϑʔͷΫϥυωΠςΟϒͷऔΓ ΈͱͦΕΛࢧ͑ΔγεςϜ։ൃʯΑΓ 400 clusters 2019 x8
Lv.01 No Automation 02 Externally Maintained System-specific Automation 03 Externally Maintained Generic Automation 04 Internally Maintained System-specific Automation 05 Systems That Don’t Need Any Automation Automation Evolution Site Reliability Engineering Chapter 7 - A Hierarchy of Automation Classes
What’s Difference between Procedural Model and Reconciliation Model? Advantages of k8s #01 Script: +Sc Server v1 Desire: Current: Server v1 Reconciler Observe Procedural Model Reconciliation Model Run Script Update Desire State
Advantages of k8s #01 Script: +Sc Server v2 Desire: Current: Server v2 Reconciler Observe Reconcile Deploy Procedural Model Reconciliation Model What’s Difference between Procedural Model and Reconciliation Model? Run Script Update Desire State
Script: +Sc Server v2 Desire: Current: Server v2 Reconciler Observe Script: Fix B Failure Failure Advantages of k8s #01 Procedural Model Reconciliation Model What’s Difference between Procedural Model and Reconciliation Model? Run Script
Reconciler Provides Declarative API to Keep a System Desired State Advantages of k8s #01 Script: +Sc Server v3 Desire: Current: Server v2 Reconciler Recovery Procedural Model Reconciliation Model Script: Fix B Reconcile Observe Run Script
Advantages of k8s #02 Kubernetes APIʹಠࣗͷϦιʔεΛఆٛ͢ΔͨΊͷػೳ CRDͰఆٛͨ͠ϦιʔεΧελϜϦιʔε(CR)ͱݺΕΔ Custom Resource Definition(CRD) /apis/apps/v1/namespaces/default/deployments /apis///namespaces/default/ Platform Platform Platform API API CRD Defines Custom API The Automation Platform Can Be Further Extended Call Call Extended ɹɹɹ ed API Server Endpoints API
Design Automation Platform Which Resource to Observe and How to Reconcile CR Reconciler CRD Call API API Server Observe Register Reconcile KubernetesͷػೳΛར༻͢ΕAPIαʔόΫϥΠ Ξϯτͷ։ൃෆཁͷͨΊɺϏδωεϩδοΫ෦ ͷReconcilerͱॲཧରͷCRDઃܭʹूதͰ͖Δ
API Server Operator Operator Operator Leader Election Reliability Leader Config Map Success to update CM ResourceVersio n is too Old.. ctrlOpts := ctrl.Options{ LeaderElection: leConfig, LeaderElectionNamespace: leNamespace, LeaderElectionID: leName, } Optimistic Resource Lock LeaderͷΈ͕Controller͕ىಈ͠ɺΓ ͷFollowerϗοτελϯόΠʹͳΔ CMͷݖݶʹҙ Just set LE configuration to Manager
Kubernetesͷςετ༻ʹ։ൃ͞ ΕͨπʔϧͰ1ϊʔυΛ1ίϯςφ ͱͯ͠ىಈ͢ΔɻखܰʹϚϧν ϊʔυϚϧνΫϥελߏͷࢼ ݧ͕Ͱ͖Δ ࣮ڥͰςετߴίετ͕ͩɺ Node͕ίϯςφͰಈ࡞֬ೝͰ ͖ͳ͍ػೳ࣮ڥͰͷ֬ೝͷͨ ΊલೋͭͱΈ߹Θ࣮ͤͯࢪ͢Δ EtcdͱAPI Serverͷ2όΠφϦͷ ΈΛىಈͤ͞ΔͨΊɺܰྔͩ ͕ɺ”APIϨϕϧ”ͷಈ࡞֬ೝ͔͠Ͱ ͖ͳ͍ɻControllerManager Kubelet͍ͳ͍ͷͰUTతʹͳΔ Container Kind Full Cluster Your Real Platform API server Testing Framework How to setup Kubernetes Cluster for Test?
Event Recorder To Understand Internal Behavior kubectl get eventɺdescribeͨ࣌͠ʹऩूͯ͠Α͠ͳʹ֘͢ΔϦ ιʔεͷEventsϑΟʔϧυʹදࣔͯ͘͠ΕΔEvent k8sຊՈͷe2eͷΑ͏ʹςετ࣌ͷEventΛऩूͨ͠Γɺো͕ى͖ͨ࣌ ʹ࠷ॳͷ͋ͨΓΛ͚ͭΔͨΊʹ֬ೝͨ͠ΓɺΫϥελΛ࡞͢Δͱ͖ͷ Ϧιʔεͷಈ͖Λ؍ͨ͠Γͱɺͬ͘͟Γͱͨ͠ڍಈͷѲʹศར $ kubectl get event LAST SEEN TYPE REASON OBJECT MESSAGE 5s Normal Scheduled pod/cndt-cbb75cdc5-mws7l Successfully assigned default/cndt-cbb75cdc5-mws7l to worker3 4s Normal Pulling pod/cndt-cbb75cdc5-mws7l Pulling image "gcr.io/hello-minikube-zero-install/hello-node" 5s Normal SuccessfulCreate replicaset/cndt-cbb75cdc5 Created pod: cndt-cbb75cdc5-mws7l 5s Normal ScalingReplicaSet deployment/cndt Scaled up replica set cndt-cbb75cdc5 to 1
Operator Metrics To Understand Internal Behavior ControllerΛӡ༻্͍ͯ͘͠ͰඞཁͳWorkqueueͷΩϡʔΠϯά ΤϥʔͳͲɺϝτϦΫεެ։ʹඞཁͳͷcontroller-runtimeͰ༻ ҙ͞Ε͍ͯΔͨΊɺManagerͷઃఆΛՃ͢Δ͚ͩɻ ϝτϦΫεσʔλɺPrometheus Metric Format $ curl http://localhost:8080/metrics # HELP controller_runtime_reconcile_errors_total Total number of reconcile errors per controller # TYPE controller_runtime_reconcile_errors_total counter controller_runtime_reconcile_errors_total{controller="mysql-controller"} 10 # HELP controller_runtime_reconcile_queue_length Length of reconcile queue per controller # TYPE controller_runtime_reconcile_queue_length gauge