Testable Kubernetes Operators? by Marcel Müller

Slide 1

Slide 1 text

Testable Kubernetes Operators? Marcel Müller Platform Engineer @muemarcel

Slide 2

Slide 2 text

What are kubernetes operators? 2

Slide 3

Slide 3 text

Operator Definition 3 - Kubernetes’ controllers concept lets you extend the clusters behaviour without modifying the code of Kubernetes itself. - Operators are clients of the Kubernetes API that act as controllers for a Custom Resource. Source: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

Slide 4

Slide 4 text

Controller Definition 4 - A controller tracks at least one Kubernetes resource type. - These objects have a spec field that represents the desired state. - The controller(s) for that resource are responsible for making the current state come closer to that desired state. Source: https://kubernetes.io/docs/concepts/architecture/controller/

Slide 5

Slide 5 text

- Operators act like controllers - Operators are clients of the Kubernetes API 5 K8s-API Operator Watch CR Controller manager Kube scheduler

Slide 6

Slide 6 text

- Desired state in Spec - Current state as reality - Reconcile by applying diff to current state - Periodically get desired state through list 6 Observe Kubernetes watch|list Evaluate against current state Reconcile

Slide 7

Slide 7 text

Operator Definition 7 Kubernetes API Operator watch events Custom Resource (Definition) submits Take a decision

Slide 8

Slide 8 text

- Watches Prometheus CR - Creates prometheus pod deployments - Continuously reconciles desired configuration with actual deployment Example: Prometheus-Operator 8 https://github.com/coreos/prometheus-operator

Slide 9

Slide 9 text

- Watches Cluster CR - Creates kubernetes clusters on AWS matching CR Spec - Continuously reconciles desired configuration with actual cluster Example: AWS-Operator 9 https://github.com/kubernetes-sigs/cluster-api https://github.com/giantswarm/aws-operator

Slide 10

Slide 10 text

What makes operators hard to test? 10

Slide 11

Slide 11 text

- Operators as a tool for infrastructure management - Managing stateful resources outside of k8s External APIs 11

Slide 12

Slide 12 text

- Same challenges as other applications - Already has hard dependency on k8s API External APIs 12

Slide 13

Slide 13 text

- Reconciliation might never reach consistent state - Multiple loops might be needed for consistency Eventual consistency 13

Slide 14

Slide 14 text

External APIs + Eventual consistency = Flapping integration tests 14

Slide 15

Slide 15 text

Kubernetes concepts which make life easier 15

Slide 16

Slide 16 text

- []string in the object metadata - Object will only be deleted from the k8s API if empty - Deletion events will be replayed while finalizers exist Finalizers 16

Slide 17

Slide 17 text

- deletionTimestamp indicates that object should be deleted - Controllers should remove themselves! 17 apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster metadata: … deletionTimestamp: "2019-11-12T12:45:47Z" finalizers: - aws-operator-cluster-controller - cluster-operator-cluster-controller name: demo namespace: default ...

Slide 18

Slide 18 text

- Improve stability in case of missed events - Absolutely necessary if multiple controllers watch the same object! Finalizers 18

Slide 19

Slide 19 text

- Identifies the state of an object as a number - Changes only if the object has changed Resource Version 19

Slide 20

Slide 20 text

- Should be stored when reading the object - Can be applied again when updating the object -> Ensures that it has not changed in the meantime 20 apiVersion: cluster.k8s.io/v1alpha1 kind: Cluster metadata: … name: demo namespace: default resourceVersion: "22453751" ...

Slide 21

Slide 21 text

- Prevents most simple race conditions - Prevents accidental object manipulation in test suites! Resource Version 21

Slide 22

Slide 22 text

- Status as reflection of current state - Reflect error and failure states Status 22

Slide 23

Slide 23 text

- Status is defined by you - Treat the CR as an API apiVersion: v1 kind: Node metadata: ... spec: ... status: ... conditions: - lastHeartbeatTime: "2019-11-12T13:02:14Z" lastTransitionTime: "2019-11-12T13:02:14Z" message: Calico is running on this node reason: CalicoIsUp status: "False" type: NetworkUnavailable ... 23

Slide 24

Slide 24 text

- Good status implementation allows tests to fail fast - Transition timestamps give performance insights Status 24

Slide 25

Slide 25 text

Experiences from testing our operators 25

Slide 26

Slide 26 text

- Waiting in the reconciliation loops introduces more timeouts - Tests should decide when an action is taking too long - Enforce SLAs in tests, monitor them in production Never wait during reconciliation 26

Slide 27

Slide 27 text

- Issues can be very complex in a distributed system - Don’t settle for some logs - kind export logs Get all the logs 27 https://github.com/kubernetes-sigs/kind

Slide 28

Slide 28 text

- Don’t get complacent - Most issues with flapping are actual issues with the operator “It’s just another flap” 28

Slide 29

Slide 29 text

Testable kubernetes operators? 29

Slide 30

Slide 30 text

Questions? Stay in touch - Twitter @muemarcel - Github MarcelMue - Meet me at the conference! Thank you! 30