Slide 1

Slide 1 text


 ೔ʑͷΦϖϨʔγϣϯΛ ࣗಈԽ͢Δ Cloud Native Days Tokyo 2019 Aya Igarashi @Ladicle Λ֦ுͯ͠ Kubernetes

Slide 2

Slide 2 text

@Ladicle Software Engineer

Slide 3

Slide 3 text

Z Lab Automate Kubernetes Cluster Management KubernetesΛར༻ͯ͠Kubernetes Clusterͷ ࡞੒ɺ࡟আɺΞοϓάϨʔυɺো֐෮چͳͲͷ ΦϖϨʔγϣϯΛࣗಈԽ͍ͯ͠Δ

Slide 4

Slide 4 text

Why We Should Automate?

Slide 5

Slide 5 text

Why We Should Automate? If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow. Carla Geisser, Google SRE Chapter 5 - Eliminating Toil “ ”

Slide 6

Slide 6 text

50 clusters 2018 Do Not Work O(N) Manual Operations with Service Growth Kubernetes Clusterͷ؅ཧ͸ো֐࣌ͷ෮ چ΍ΞοϓάϨʔυ͕͋Δ͕ɺαʔϏεͷ ੒௕ʹ߹Θͤͯ࡞ۀ΋O(N)Ͱ૿΍ͯ͠͠ ·͏ͱ؅ཧ͖͠Εͳ͍ Automation is Required ϠϑʔࣾͰͷར༻࣮੷˞ ※ Cloud Native Days Tokyo 2019 ʮϠϑʔͷΫϥ΢υωΠςΟϒ΁ͷऔΓ ૊ΈͱͦΕΛࢧ͑ΔγεςϜ։ൃʯΑΓ 400 clusters 2019 x8

Slide 7

Slide 7 text

Faster ਓ͕ؒखॱॻ௨ΓʹίϐϖͰ ࡞ۀΛਐΊΔΑΓ΋ϓϩάϥ Ϝ࣮ߦͷํ͕ૣ͘ɺਂ໷࡞ ۀͳͲ࣌ؒత੍໿΋ͳ͍ Scalability ਓΑΓૉૣ͘܁Γฦ͠ॲཧ Λಘҙͱ͢ΔͨΊɺख࡞ۀ ΑΓ΋γεςϜͷൃలʹ߹Θ ͤͯεέʔϧ͠΍͍͢ Reliability ਓؒ͸ϛεΛ͢Δ΋ͷͳͷ Ͱɺ৴པੑΛอͭΑ͏ʹɺಉ ͡࡞ۀΛਖ਼֬ʹ܁Γฦ͢͜ ͱ͸ෆՄೳʹ͍ۙ The Profit of Automation

Slide 8

Slide 8 text

Why kubernetes?

Slide 9

Slide 9 text

Lv.01 No Automation 02 Externally Maintained System-specific Automation 03 Externally Maintained Generic Automation 04 Internally Maintained System-specific Automation 05 Systems That Don’t Need Any Automation Automation Evolution Site Reliability Engineering Chapter 7 - A Hierarchy of Automation Classes

Slide 10

Slide 10 text

Platform Style Automation Makes Your System Robust Observability εΫϦϓτͷϝτϦΫεΛऔΔͷ͸େม͕ͩɺϓ ϥοτϑΥʔϜࣜͰ͋Ε͹ɺैདྷͷϞχλϦϯάγ εςϜ͕࢖͑ɺγεςϜͷ಺෦ঢ়گͷ೺Ѳʹ໾ཱͭ Testability σϓϩΠεΫϦϓτ͸සൟʹςετ͞Εͳ͍܏޲ʹ ͋Δ͕ɺϓϥοτϑΥʔϜͱͯ͠ߏங͢Ε͹ैདྷͷ ιϑτ΢ΣΞͱಉ͡Α͏ʹςετͰ͖Δ Maintainability γεςϜͷதʹॲཧؚ͕·Ε͍ͯΔͷͰɺࣗಈԽε ΫϦϓτΛແࢹͨ͠ઃܭʹͳΔ͜ͱ΋ͳ͘ɺϝϯς φϯε͠ଓ͚Δ͜ͱ͕Ͱ͖Δ

Slide 11

Slide 11 text

CRDΛ͸͡Ίͱͨ͠๛෋ͳ֦ுػೳΛඋ͓͑ͯΓɺ ج൫Λ࡞Δج൫ͱͯ͠࢖͍উख͕Α͍ͨΊ #02 ௐ੔ϞσϧʹΑͬͯࣗ཯ͨ͠γεςϜͷ ߏஙՄೳͰ͋Γɺ·ͨɺखଓ͖తͰ͸ͳ ͘એݴతͳAPIΛఏڙͰ͖ΔͨΊ #01 2 Reasons Why You Should Customize Kubernetes For Automation Reconciliation Model Custom Resource Definition

Slide 12

Slide 12 text

What’s Difference between Procedural Model and Reconciliation Model? Advantages of k8s #01 Script: +Sc Server v1 Desire: Current: Server v1 Reconciler Observe Procedural Model Reconciliation Model Run Script Update Desire State

Slide 13

Slide 13 text

Advantages of k8s #01 Script: +Sc Server v2 Desire: Current: Server v2 Reconciler Observe Reconcile Deploy Procedural Model Reconciliation Model What’s Difference between Procedural Model and Reconciliation Model? Run Script Update Desire State

Slide 14

Slide 14 text

Script: +Sc Server v2 Desire: Current: Server v2 Reconciler Observe Script: Fix B Failure Failure Advantages of k8s #01 Procedural Model Reconciliation Model What’s Difference between Procedural Model and Reconciliation Model? Run Script

Slide 15

Slide 15 text

Reconciler Provides Declarative API to Keep a System Desired State Advantages of k8s #01 Script: +Sc Server v3 Desire: Current: Server v2 Reconciler Recovery Procedural Model Reconciliation Model Script: Fix B Reconcile Observe Run Script

Slide 16

Slide 16 text

Advantages of k8s #02 Kubernetes APIʹಠࣗͷϦιʔεΛఆٛ͢ΔͨΊͷػೳ CRDͰఆٛͨ͠Ϧιʔε͸ΧελϜϦιʔε(CR)ͱݺ͹ΕΔ Custom Resource Definition(CRD) /apis/apps/v1/namespaces/default/deployments /apis///namespaces/default/ Platform Platform Platform API API CRD Defines Custom API The Automation Platform Can Be Further Extended Call Call Extended ɹɹɹ ed API Server Endpoints API

Slide 17

Slide 17 text

CRD Kubernetes Cluster CRD Kubernetes Cluster Reconciler CRD Kubernetes Cluster Reconciler Reconciler CR Kubernetes Cluster Z Lab Automate Kubernetes Cluster Management Using k8s Reconciler CRD Kubernetes Cluster Call API API Server Observe Register Reconcile

Slide 18

Slide 18 text

How Do We Customize Kubernetes?

Slide 19

Slide 19 text

Development Flow #01 Design #03 Testing #02 Implementation #04 Maintenance Kubernetes͕ԿΛͲ͜·Ͱอোͯ͠ ͍Δͷ͔Λ೺Ѳ͠ɺReconciliation Loopͷಛ௃Λཧղ͢Δ ։ൃ؀ڥͷ੔උ͕ਐΜͰ͖͍ͯΔͷͰ Framework͸ੵۃతʹ׆༻͢Δ Kubernetesຊମ͸ྑ͍αϯϓϧίʔυ ௨ৗͷιϑτ΢ΣΞ։ൃͱಉ͡Α͏ʹ ςετͷ༻్ʹԠ࣮ͯ͡૷ํ๏΍ςε τ؀ڥͷߏஙํ๏Λݕ౼͢Δ όʔδϣϯߋ৽ͳͲɺ·ͩൃల్্ͷ ػೳ͕ଟ͍ͨΊɺKubernetes΍पลͷ ϓϩμΫτͷΞοϓσʔτΛ௥͏

Slide 20

Slide 20 text

#01 Design Kubernetes͕ԿΛͲ͜·Ͱอোͯ͠ ͍Δͷ͔Λ೺Ѳ͠ɺReconciliation Loopͷಛ௃Λཧղ͢Δ

Slide 21

Slide 21 text

Design Automation Platform Which Resource to Observe and How to Reconcile CR Reconciler CRD Call API API Server Observe Register Reconcile KubernetesͷػೳΛར༻͢Ε͹APIαʔό΍ΫϥΠ Ξϯτͷ։ൃ͸ෆཁͷͨΊɺϏδωεϩδοΫ෦෼ ͷReconcilerͱॲཧର৅ͷCRDઃܭʹूதͰ͖Δ

Slide 22

Slide 22 text

Design CRD and Reconciler Example: Scaling the Number of RS replicas from 1 to 2 ReplicaSet Controller API Server Observe ReplicaSet & Pod Reconcile apiVersion: apps/v1 kind: ReplicaSet metadata: name: sample spec: replicas: 2 … status: replicas: 1 … apiVersion: v1 kind: Pod metadata: name: sample-6f477f…. … spec: containers: 1. Analyse Resource 2. Create Pod 3. Update .Status ௐ੔ର৅ͱͳΔϦιʔεͷܾఆ ௐ੔ܖػͱͳΔ؂ࢹϦιʔεͷܾఆ ϦιʔεΛͲ͏ௐ੔͍͔ͯ͘͠ ͭ·Γ… ҎԼͷ఺Λઃܭ͢Ε͹Α͍

Slide 23

Slide 23 text

CRD Design Tips CRD͸ࣗ༝ʹεΩʔϚΛઃఆͰ͖Δ ͕ɺ.Spec/.StatusΛఆٛ͠ͳ͍ͱk8s ͷ֤छԸܙ͕ड͚ΒΕͳ͍ɻ·ͨɺඪ ४ͷϦιʔεߏ଄͸ࢀߟʹͳΔͷͰɺε ΩʔϚͱͦͷಛ௃Λ೺Ѳ͓ͯ͘͠ͱ˓ Values for Analysis and Action Changeability Condition ௐ੔ʹඞཁͳσʔλΛ੔ཧ͢Δ status: conditions: - lastProbeTime: null lastTransitionTime: "2019-05-01T15:05:05Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2019-05-01T15:32:41Z" status: "True" type: Ready - lastProbeTime: null Ϧιʔεͷ஋͸େ·͔ʹ .Spec ͱ .Status ͷࠩ෼͔ ΒఆٛͱͷζϨݕग़ʹ࢖͏஋ͱɺζϨमਖ਼ͷΞΫ γϣϯ࣌ʹͷΈ࢖͏Φϓγϣϯʹ෼͚ΒΕΔɻ Φϓγϣϯ͸ߏ଄తʹอ࣋͢Δͱมߋ͠΍͍͢ɻ Condition͸มߋ͢Δ͜ͱ͕ଟ͍ͷͰ഑ྻͰ·ͱΊΔ

Slide 24

Slide 24 text

Update .Status Reconciler Design Tips .spec/.statusΛॲཧ͢ΔReconcilerΛ ੍ݶ͠ɺௐ੔ର৅͸ݪଇ1Ϧιʔεʹࢭ Ίґଘؔ܎ΛݮΒ͢ɻؔ࿈Ϧιʔεͷࢀ র͕ඞཁͳ৔߹͸ɺLabel΍.specͷε ΩʔϚΛ޻෉ͯ͠ܭࢉྔΛ࡟ݮ͢Δɻ ԿΛ͍ͭͲ͏ௐ੔͢Δ͔ ChaosCron Reconciler ChaosSet Reconciler Chaos Reconciler Chaos C ChaosCron =A ChaosSet =B A ChaosCron B ChaosSet ChaosCron =A Create .Spec

Slide 25

Slide 25 text

Deletion Propagation Policy How to Delete Resources Correctly? ऴྃॲཧ͸.SpecͰ͸ͳ͘ MetadataʹԠͯ͡ॲཧ͢Δ GC default Background Parent Child Parent Child Foreground Finalizer 1. Delete Object 1. Delete Object Owner Reference 2. Delete Child 2. Added Finalize by API Server 3. Delete Child and Remove Finalizer 4.Deleted k8s GCΛར༻͢ΔͨΊʹΦϒδΣΫτʹ͸ OwnerReferenceΛઃఆ͢Δ K8s GCର৅֎ͷ֎෦ϦιʔεΛࢠʹ࣋ͭΑ ͏ͳΦϒδΣΫτʹ͸FinalizerΛར༻͢Δ k8sඪ४Ϧιʔεʹର͢ΔFinalizer͸ Validation͕͔͔͍ͬͯΔͷͰ໋໊ʹ஫ҙ / (ex. ladicle.com/gopher-cleanup)

Slide 26

Slide 26 text

#02 Implementation ։ൃ؀ڥͷ੔උ͕ਐΜͰ͖͍ͯΔͷͰ Framework͸ੵۃతʹ׆༻͢Δ Kubernetesຊମ͸ྑ͍αϯϓϧίʔυ

Slide 27

Slide 27 text

controller- runtime client-go controller- tools Kubebuilder client-go k8sͷϕʔεͱͳΔclientϥΠϒϥϦͰɺcontroller- runtime͔Β͸ΩϟογϡपΓͷॲཧ͕࢖ΘΕ͍ͯΔ controller-runtime KubebuilderͷαϒϓϩδΣΫτͰɺReconcilerΛ։ ൃ͢ΔͨΊʹඞཁͳఆܕॲཧͷϥΠϒϥϦ controller-tools ಉ͘͡αϒϓϩδΣΫτͰɺReconcilerͰར༻͢ΔϚ χϑΣετ΍ίʔυΛੜ੒͢ΔδΣωϨʔλ Kubebuilder ίʔυͷ਽ܗ/ϏϧυϑΝΠϧ/ςετ؀ڥߏஙεΫϦϓ τͳͲ։ൃʹඞཁͳҰࣜΛੜ੒ͯ͘͠ΕΔSDK Libraries and Tools for Developing Reconciler and CR

Slide 28

Slide 28 text

controller-runtime Overview API Server Operator Cache Manager Controller Reconciler Controller Reconciler Evt KV Evt

Slide 29

Slide 29 text

Reliability Scalability Operator Development Tips

Slide 30

Slide 30 text

Filter Reconciliation Target Controller Watcher Evt EvtEvt Source Event Handler Obj Evt Scalability Watcher Obj Evt … ctrl.NewControllerManagedBy(mgr). # Set Default Watcher for Reconcile Resource For(&event.CNDT{}). # Set Default Watcher for Own Resource Owns(&event.Session{}). # Set Custom Watcher Watches( &source.Channel{Source: events}, &handler.EnqueueRequestsFromMapFunc{ ToRequests: handler.ToRequestsFunc( func{a handler.MapObject} []reconcile.Request { if a.Meta.GetLabels()["year"] != "2019" { return nil } return []reconcile.Request{ { Name: a.Meta.GetLabels()[“cndt”], Namespace: a.Meta.GetNamespace(), }, }, }), }, &predicate.Funcs{ Predicate Req Evt Source Event Handler Predicate Evt Req For( ) Owns( )

Slide 31

Slide 31 text

Set the appropriate number of worker, and choose requeue method Scalability Controller Workqueue Worker … Req Req Req Reconciler Handler Req Worker Reconciler Handler Req Watcher Watcher vent andler vent andler Req Req Requeue Deduplication & RateLimited

Slide 32

Slide 32 text

Error and Retry Scalability & Reliability Worker … Reconciler Handler Req Worker Reconciler Handler Req Requeue mited Requeue Method RateLimited Error != nil Error == nil && Requeue = True After X duration (Reset Counter) Error == nil && RequeueAfter = X Delete from WorkQueue Else

Slide 33

Slide 33 text

API Server Operator Operator Operator Leader Election Reliability Leader Config Map Success to update CM ResourceVersio n is too Old.. ctrlOpts := ctrl.Options{ LeaderElection: leConfig, LeaderElectionNamespace: leNamespace, LeaderElectionID: leName, } Optimistic Resource Lock LeaderͷΈ͕Controller͕ىಈ͠ɺ࢒Γ ͷFollower͸ϗοτελϯόΠʹͳΔ CMͷݖݶʹ஫ҙ Just set LE configuration to Manager

Slide 34

Slide 34 text

Syntax and Semantic Validation Reliability KubeCon EU 2018 – Sig API Machinery Deep

Slide 35

Slide 35 text

#03 Testing ௨ৗͷιϑτ΢ΣΞ։ൃͱಉ͡Α͏ʹς ετͷ༻్ʹԠ࣮ͯ͡૷ํ๏΍ςετ؀ ڥͷߏஙํ๏Λݕ౼͢Δ

Slide 36

Slide 36 text

End-to-End Test Ginkgo & Gomega Unit-Test FakeClient ΤϯυπʔΤϯυςετ͸ɺΞϓϦέʔγϣϯ ͷϑϩʔ͕࠷ॳ͔Β࠷ޙ·ͰઃܭͲ͓Γʹ࣮ߦ ͞Ε͍ͯΔ͔Ͳ͏͔Λςετ͢Δख๏ɻ Ginkgo͸BDDελΠϧͷGoςετϑϨʔϜ ϫʔΫɻجຊతʹ͸GomegaͷmatcherϥΠ ϒϥϦͱར༻͢Δɻ controller-runtimeͰ͸ɺe2e༻ͷςετ؀ڥ ߏங(όΠφϦํࣜ)ͱkubeconfigಡΈࠐΈ෦ ෼ͷϥΠϒϥϦΛఏڙ͍ͯ͠Δɻ ϊʔϚϧͳGoͷUnitςετΛॻ͘ํࣜɻ Reconcilerʹ౉͢KubernetesͷAPI ClientΛ fakeClientʹࠩ͠ସ͑ɺظ଴͢ΔAPIΞΫγϣ ϯΛઃఆͨ͠ΓɺΞΫγϣϯޙͷϦιʔεঢ়ଶ ͷ֬ೝʹར༻͢Δɻ(Finalizer౳͸ະରԠ) Kubebuilder͸ϝϯςͷ͠ਏ͔͞Β FakeClientํࣜΛඇਪ঑ͱ͍ͯ͠Δ OperatorSDK͸UT & e2eͷ2ஈߏ੒ ̎ Testing Methods

Slide 37

Slide 37 text

Kubernetesͷςετ༻ʹ։ൃ͞ ΕͨπʔϧͰ1ϊʔυΛ1ίϯςφ ͱͯ͠ىಈ͢ΔɻखܰʹϚϧν ϊʔυ΍ϚϧνΫϥελߏ੒ͷࢼ ݧ͕Ͱ͖Δ ࣮؀ڥͰςετ͸ߴίετ͕ͩɺ Node͕ίϯςφͰ͸ಈ࡞֬ೝͰ ͖ͳ͍ػೳ΍࣮؀ڥͰͷ֬ೝͷͨ Ίલೋͭͱ૊Έ߹Θ࣮ͤͯࢪ͢Δ EtcdͱAPI Serverͷ2όΠφϦͷ ΈΛىಈͤ͞ΔͨΊɺܰྔͩ ͕ɺ”APIϨϕϧ”ͷಈ࡞֬ೝ͔͠Ͱ ͖ͳ͍ɻControllerManager΋ Kubelet΋͍ͳ͍ͷͰUTతʹͳΔ Container Kind Full Cluster Your Real Platform API server Testing Framework How to setup Kubernetes Cluster for Test?

Slide 38

Slide 38 text

#04 Maintenance όʔδϣϯߋ৽ͳͲɺ·ͩൃల్্ͷػ ೳ͕ଟ͍ͨΊɺKubernetes΍पลͷϓ ϩμΫτͷΞοϓσʔτΛ௥͓͏

Slide 39

Slide 39 text

Event Recorder To Understand Internal Behavior kubectl get event΍ɺdescribeͨ࣌͠ʹऩूͯ͠Α͠ͳʹ֘౰͢ΔϦ ιʔεͷEventsϑΟʔϧυʹදࣔͯ͘͠ΕΔEvent k8sຊՈͷe2eͷΑ͏ʹςετ࣌ͷEventΛऩूͨ͠Γɺো֐͕ى͖ͨ࣌ ʹ࠷ॳͷ͋ͨΓΛ͚ͭΔͨΊʹ֬ೝͨ͠ΓɺΫϥελΛ࡞੒͢Δͱ͖ͷ Ϧιʔεͷಈ͖Λ؍࡯ͨ͠Γͱɺͬ͘͟Γͱͨ͠ڍಈͷ೺Ѳʹศར $ kubectl get event LAST SEEN TYPE REASON OBJECT MESSAGE 5s Normal Scheduled pod/cndt-cbb75cdc5-mws7l Successfully assigned default/cndt-cbb75cdc5-mws7l to worker3 4s Normal Pulling pod/cndt-cbb75cdc5-mws7l Pulling image "gcr.io/hello-minikube-zero-install/hello-node" 5s Normal SuccessfulCreate replicaset/cndt-cbb75cdc5 Created pod: cndt-cbb75cdc5-mws7l 5s Normal ScalingReplicaSet deployment/cndt Scaled up replica set cndt-cbb75cdc5 to 1

Slide 40

Slide 40 text

Operator Metrics To Understand Internal Behavior ControllerΛӡ༻্͍ͯ͘͠ͰඞཁͳWorkqueueͷΩϡʔΠϯά਺΍ Τϥʔ਺ͳͲɺϝτϦΫεެ։ʹඞཁͳ΋ͷ͸controller-runtimeͰ༻ ҙ͞Ε͍ͯΔͨΊɺManagerͷઃఆ஋Λ௥Ճ͢Δ͚ͩɻ ϝτϦΫεσʔλ͸ɺPrometheus Metric Format $ curl http://localhost:8080/metrics # HELP controller_runtime_reconcile_errors_total Total number of reconcile errors per controller # TYPE controller_runtime_reconcile_errors_total counter controller_runtime_reconcile_errors_total{controller="mysql-controller"} 10 # HELP controller_runtime_reconcile_queue_length Length of reconcile queue per controller # TYPE controller_runtime_reconcile_queue_length gauge

Slide 41

Slide 41 text

KubeCon EU 2018 – Sig API Machinery Deep Operator Upgrade

Slide 42

Slide 42 text

How to Deploy The Automated Platform? Operator Operator Operator Operator Bundle Deploy Automatied Platform

Slide 43

Slide 43 text

Conclusion KubernetesΛ֦ுͯ͠೔ʑͷΦϖϨʔγϣϯΛࣗಈԽ͢Δʹ͸?

Slide 44

Slide 44 text

Customize Kubernetes For Automation γεςϜͷ੒௕ʹରԠ͢ΔͨΊʹࣗಈԽ͕ඞཁ Platform StyleͷࣗಈԽΛ͢ΔͱγεςϜ͕ΑΓݎ࿚ʹͰ͖ ͨΓɺ͞ΒͳΔࣗಈԽ͕ՄೳͱͳΔ ͜ͷΑ͏ͳࣗಈԽΛਐΊΔʹ͋ͨͬͯɺReconciliation ModelΛͱΓɺCRDͳͲͷ֦ுػೳ͕๛෋ͳKubernetes͸ ج൫Λ࡞ΔͨΊͷج൫ͱͯ͠༏ल

Slide 45

Slide 45 text

Scalability ✓ Reconcilerͷ؂ࢹର৅ͱλ ΠϛϯάΛߜΔ ✓ Cache׆༻ͷͨΊ؂ࢹϦ ιʔεʹԠͯ͡Manager Λ۠੾Δ ✓ ద੾ͳWorker਺Λ Controllerʹઃఆ͢Δ ✓ ௐ੔ϧʔϓΛγϯϓϧʹ Maintenancebility ✓ SDKΛར༻͢Δ ✓ EventRecorder΍Metrics Λઃఆͯ͠ঢ়ଶ೺Ѳ͢Δ ✓ ConversionWebhookͰ ޙํޓ׵ੑΛอͭ ✓ ࣗ਎ͷϦϦʔε/σϓϩΠ ΋ࣗಈԽʹ૊ΈࠐΉ Reliability ✓ Self-HealingͰ͖ΔΑ͏ ʹద੾ͳ؂ࢹͱௐ੔Λߦ͏ ✓ ద੾ͳRetryΛઃఆ͢Δ ✓ FinalizerͰϦιʔεΛਖ਼ ৗऴྃͤ͞Δ ✓ ςετΛ࣮ࢪ͢Δ ✓ LEΛઃఆ͠ো֐ʹඋ͑Δ Designing Automation Platform for RSM

Slide 46

Slide 46 text

WE ARE HIRING! ͝ڵຯͷ͋Δํ͸ɺZ Lab ࣾһʹ௚઀͝࿈བྷ͍ͩ͘͞

Slide 47

Slide 47 text

Thank You! For you time & we’ll see you soon @ladicle