Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE Principle and Operator Practice

SRE Principle and Operator Practice

Red Hat DevNation Tech Talk: The Kubernetes Operator pattern is rooted in Site Reliability Engineering (SRE) automation principles[1], and SRE's key service metrics called "golden signals" provide the basic feedback stream for sophisticated, proactive controller logic in an Operator. Consider it a bug when your Operator's application requires human intervention. Operators are where you write code to fix those bugs and automate operations chores. The Operator maturity model is a way of thinking about iterative development of this kind of advanced automation on Kubernetes.

[1]: https://coreos.com/blog/introducing-operators.html

Josh Wood

July 16, 2020
Tweet

More Decks by Josh Wood

Other Decks in Technology

Transcript

  1. Any application in any system must be installed, configured, managed

    and upgraded over time Patching is critical to security
  2. Custom Resource Developer / Kubernetes User Deployments StatefulSets Autoscalers Secrets

    Config maps PersistentVolume How Does an Operator Work? K8s API kind: ProductionReadyDatabase apiVersion: database .example.com/v1alpha1 metadata: name: my-important -database spec: connectionPoolSize: 300 readReplicas: 2 version: v4.0.1 Custom Kubernetes Controller Watch Events Reconciliation + Custom Resource Definition Kubernetes Operator Native Kubernetes Resources
  3. Value of Operators Improve the “time to first value” for

    your customers Minimize software upgrade risk and associated operational costs Embed best practices from the experts – you – into the Operator Provide a cloud-like "As a Service" experience
  4. Red Hat Products ISV Partners Community TYPES OF OPERATORS OPERATOR

    HUB Operator Hub - Allows administrators to selectively make operators available from curated sources to users in the cluster.
  5. Operator Maturity Model Phase I Phase II Phase III Phase

    IV Phase V Basic Install Automated application provisioning and configuration management Seamless Upgrades Patch and minor version upgrades supported Full Lifecycle App lifecycle, storage lifecycle (backup, failure recovery) Deep Insights Metrics, alerts, log processing and workload analysis Auto Pilot Horizontal/vertical scaling, auto config tuning, abnormal detection, scheduling tuning
  6. • O’Reilly “SRE Book” (Beyer et al) • Carla Geisser

    (al) paraphrased: ~“Human intervention… is a bug” • SREs write code to fix those bugs • SREs write software to run other software • SREs write Kubernetes Operators Site Reliability Engineering (SRE)
  7. • Can you set operand configuration in the CR? •

    Do CR changes cause non-disruptive updates to the Operand? • Does CR status show what has and hasn’t been applied? Level 1 Installation - Deployment
  8. • Can the Operator upgrade its Operand? • Without disruption?

    • Does CR status show what has and hasn’t been upgraded? Level 2 Upgrades
  9. • Can your Operator back up its Operand? • Can

    your Operator restore from a previous Operand backup? • Ready/Live probes? Active monitoring of basic execution state? • CPU and other requests and limits set for Operand? Level 3 Full Lifecycle Management
  10. • Does the Operator expose metrics about its own health?

    • Metrics and alerts for the Operand? • Does CR status show what has and hasn’t been applied? Level 4 Deep Insights
  11. The RED Method defines three key metrics for every service

    • Rate (the number of requests per second) • Errors (the number of those requests that are failing) • Duration (the amount of time those requests take) RED Rate (aka Traffic) - Errors - Duration (aka Latency)
  12. • Marine autopilots are reasonable models, especially with rudder position

    feedback • Auto scaling, healing, tuning ◦ Detect condition from metrics, scale horizontally (Replicas) or vertically (Requests/Limits) ◦ Think especially about scaling back down; resource savings ◦ Detecting deterioration in Operand(s) (based on Level 4’s metrics) and take action to redeploy or reconfigure • CR Status, custom Events: Clear status and especially error conditions Level 5 Auto Pilot
  13. “Toil Not, Neither Spin” (Kubernetes Operators, Dobies & Wood) SRE

    defines “toil” as: • Automatable - your computer would enjoy it! • Without enduring value - needs done but doesn’t change the system • Grows linearly with growth of the system Level 5 (cont.) Auto Pilot
  14. Operator Maturity Model Phase I Phase II Phase III Phase

    IV Phase V Basic Install Automated application provisioning and configuration management Seamless Upgrades Patch and minor version upgrades supported Full Lifecycle App lifecycle, storage lifecycle (backup, failure recovery) Deep Insights Metrics, alerts, log processing and workload analysis Auto Pilot Horizontal/vertical scaling, auto config tuning, abnormal detection, scheduling tuning
  15. • SRE stuff: Add metrics awareness and tuning to your

    Operator • Other APIs / API representations: k8fs? • K8fs presents Kubernetes API as a synthetic file hierarchy • % cp manifest.yaml /mnt/k8s/ns/default/deployments/ • % echo 3 >/mnt/k8s/ns/default/deployments/myapp/replicas Experiments/Challenges “...left as an exercise for the reader…”