SRE Principle and Operator Practice

SRE Principle and [Kubernetes] Operator Practice Josh Wood Developer Advocate,
Red Hat [email protected] – @joshixisjosh9 – joshix.com

Why should you care about Operators?

Any application in any system must be installed, configured, managed
and upgraded over time Patching is critical to security

“Anything that isn’t automated is slowing you down”

$ kubectl scale deploy/staticweb --replicas=3

Deploying a database is easy

$ kubectl create deployment db --image=quay.io/my/db

Running a database over time is harder

• Resize/Upgrade • Reconfigure • Backup • Healing

If only Kubernetes knew...

1. Application-specific custom controllers 2. Custom resource definitions (CRD) Extending
the Kubernetes API

Custom Resource Developer / Kubernetes User Deployments StatefulSets Autoscalers Secrets
Conﬁg maps PersistentVolume How Does an Operator Work? K8s API kind: ProductionReadyDatabase apiVersion: database .example.com/v1alpha1 metadata: name: my-important -database spec: connectionPoolSize: 300 readReplicas: 2 version: v4.0.1 Custom Kubernetes Controller Watch Events Reconciliation + Custom Resource Deﬁnition Kubernetes Operator Native Kubernetes Resources

Custom Resource (CR) kind: ProductionReadyDatabase apiVersion: database.example.com/v1alpha1 metadata: name: my-production-ready-database
spec: clusterSize: 3 readReplicas: 2 version: v4.0.1 [...]

Operators are automated software managers software SREs that manage the
entire lifecycle of Kubernetes applications

controllers Hausenblas, Schimanski. Programming Kubernetes. O’Reilly, 2019.

Value of Operators Improve the “time to ﬁrst value” for
your customers Minimize software upgrade risk and associated operational costs Embed best practices from the experts – you – into the Operator Provide a cloud-like "As a Service" experience

Red Hat Products ISV Partners Community TYPES OF OPERATORS OPERATOR
HUB Operator Hub - Allows administrators to selectively make operators available from curated sources to users in the cluster.

...and many more OPERATORS ACROSS THE INDUSTRY

Operator Maturity Model Phase I Phase II Phase III Phase
IV Phase V Basic Install Automated application provisioning and conﬁguration management Seamless Upgrades Patch and minor version upgrades supported Full Lifecycle App lifecycle, storage lifecycle (backup, failure recovery) Deep Insights Metrics, alerts, log processing and workload analysis Auto Pilot Horizontal/vertical scaling, auto conﬁg tuning, abnormal detection, scheduling tuning

• O’Reilly “SRE Book” (Beyer et al) • Carla Geisser
(al) paraphrased: ~“Human intervention… is a bug” • SREs write code to fix those bugs • SREs write software to run other software • SREs write Kubernetes Operators Site Reliability Engineering (SRE)

• Can you set operand configuration in the CR? •
Do CR changes cause non-disruptive updates to the Operand? • Does CR status show what has and hasn’t been applied? Level 1 Installation - Deployment

• Can the Operator upgrade its Operand? • Without disruption?
• Does CR status show what has and hasn’t been upgraded? Level 2 Upgrades

• Can your Operator back up its Operand? • Can
your Operator restore from a previous Operand backup? • Ready/Live probes? Active monitoring of basic execution state? • CPU and other requests and limits set for Operand? Level 3 Full Lifecycle Management

• Does the Operator expose metrics about its own health?
• Metrics and alerts for the Operand? • Does CR status show what has and hasn’t been applied? Level 4 Deep Insights

The RED Method defines three key metrics for every service
• Rate (the number of requests per second) • Errors (the number of those requests that are failing) • Duration (the amount of time those requests take) RED Rate (aka Traffic) - Errors - Duration (aka Latency)

• Marine autopilots are reasonable models, especially with rudder position
feedback • Auto scaling, healing, tuning ◦ Detect condition from metrics, scale horizontally (Replicas) or vertically (Requests/Limits) ◦ Think especially about scaling back down; resource savings ◦ Detecting deterioration in Operand(s) (based on Level 4’s metrics) and take action to redeploy or reconfigure • CR Status, custom Events: Clear status and especially error conditions Level 5 Auto Pilot

“Toil Not, Neither Spin” (Kubernetes Operators, Dobies & Wood) SRE
defines “toil” as: • Automatable - your computer would enjoy it! • Without enduring value - needs done but doesn’t change the system • Grows linearly with growth of the system Level 5 (cont.) Auto Pilot

Operator Maturity Model Phase I Phase II Phase III Phase
IV Phase V Basic Install Automated application provisioning and conﬁguration management Seamless Upgrades Patch and minor version upgrades supported Full Lifecycle App lifecycle, storage lifecycle (backup, failure recovery) Deep Insights Metrics, alerts, log processing and workload analysis Auto Pilot Horizontal/vertical scaling, auto conﬁg tuning, abnormal detection, scheduling tuning

• SRE stuff: Add metrics awareness and tuning to your
Operator • Other APIs / API representations: k8fs? • K8fs presents Kubernetes API as a synthetic file hierarchy • % cp manifest.yaml /mnt/k8s/ns/default/deployments/ • % echo 3 >/mnt/k8s/ns/default/deployments/myapp/replicas Experiments/Challenges “...left as an exercise for the reader…”

https://operatorframework.io https://operatorhub.io https://learn.openshift.com/operatorframework/ http://bit.ly/kubernetes-operators Resources

Thank you linkedin.com/showcase/red-hat-developer youtube - bit.ly/2YRIWTk facebook.com/redhatdeveloperprogram twitter.com/rhdevelopers 31 Josh
Wood [email protected] @joshixisjosh9 joshix.com

SRE Principle and Operator Practice

SRE Principle and Operator Practice

Josh Wood

More Decks by Josh Wood

Other Decks in Technology

Featured

Transcript

SRE Principle and [Kubernetes] Operator Practice Josh Wood Developer Advocate,

Why should you care about Operators?

Any application in any system must be installed, configured, managed

“Anything that isn’t automated is slowing you down”

$ kubectl scale deploy/staticweb --replicas=3

Deploying a database is easy

$ kubectl create deployment db --image=quay.io/my/db

Running a database over time is harder

• Resize/Upgrade • Reconfigure • Backup • Healing

If only Kubernetes knew...

1. Application-specific custom controllers 2. Custom resource definitions (CRD) Extending

Custom Resource Developer / Kubernetes User Deployments StatefulSets Autoscalers Secrets

Custom Resource (CR) kind: ProductionReadyDatabase apiVersion: database.example.com/v1alpha1 metadata: name: my-production-ready-database

Operators are automated software managers software SREs that manage the

controllers Hausenblas, Schimanski. Programming Kubernetes. O’Reilly, 2019.

Value of Operators Improve the “time to ﬁrst value” for

Red Hat Products ISV Partners Community TYPES OF OPERATORS OPERATOR

...and many more OPERATORS ACROSS THE INDUSTRY

Operator Maturity Model Phase I Phase II Phase III Phase

• O’Reilly “SRE Book” (Beyer et al) • Carla Geisser

• Can you set operand configuration in the CR? •

• Can the Operator upgrade its Operand? • Without disruption?

• Can your Operator back up its Operand? • Can

• Does the Operator expose metrics about its own health?

The RED Method defines three key metrics for every service

• Marine autopilots are reasonable models, especially with rudder position

“Toil Not, Neither Spin” (Kubernetes Operators, Dobies & Wood) SRE

Operator Maturity Model Phase I Phase II Phase III Phase

• SRE stuff: Add metrics awareness and tuning to your

https://operatorframework.io https://operatorhub.io https://learn.openshift.com/operatorframework/ http://bit.ly/kubernetes-operators Resources

Thank you linkedin.com/showcase/red-hat-developer youtube - bit.ly/2YRIWTk facebook.com/redhatdeveloperprogram twitter.com/rhdevelopers 31 Josh