Kubernetes Operators Principle and Practice

Slide 1

Slide 1 text

Kubernetes Operators: Managing Complex Software with Software Josh Wood DocOps at CoreOS @joshixisjosh9 Jesus Carrillo Sr Systems Engineer at Ticketmaster

Slide 2

Slide 2 text

Overview Scaling Stateless Apps

Slide 3

Slide 3 text

$ kubectl scale replicas=3 ... ReplicaSet

Slide 4

Slide 4 text

ReplicaSet

Slide 5

Slide 5 text

ReplicaSet

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Overview What about apps that…store data

Slide 8

Slide 8 text

$ kubectl run db --image=quay.io/my/db Creating a Database is Easy

Slide 9

Slide 9 text

● Resize/Upgrade - coordination for availability ● Reconfigure - tedious generation / templating ● Backup - requires coordination on instances ● Healing - restore backups, rejoin Managing a Distributed Database is Harder

Slide 10

Slide 10 text

If only k8s knew...: Extend Kubernetes

Slide 11

Slide 11 text

$ cat my-db-cluster.yaml spec: clusterSize: 3 readReplicas: 2 version: v4.0.1 The Goal

Slide 12

Slide 12 text

Introducing Operators

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

etcd Overview ● Distributed key-value store ● Primary datastore of Kubernetes ● Auto-leader election for availability

Slide 16

Slide 16 text

Operator Construction ● Operators build on Kubernetes concepts ● Resources: who what where ; desired state ● Controllers: Observe, Analyze, Act to reconcile resources

Slide 17

Slide 17 text

Third Party Resources ● TPRs extend the Kubernetes API with new API object types ● Akin to a database table’s schema - the data model ● Designed with custom automation mechanisms in mind ● https://kubernetes.io/docs/user-guide/thirdpartyresources/

Slide 18

Slide 18 text

$ kubectl create -f https://coreos.com/operators/etcd/latest/deployment.yaml etcd Operator

Slide 19

Slide 19 text

$ cat deployment.yaml spec: clusterSize: 3 version: v3.1.0 etcd Operator Resource

Slide 20

Slide 20 text

etcd Operator

Slide 21

Slide 21 text

etcd Operator Open-source github.com/coreos/etcd-operator

Slide 22

Slide 22 text

Kubernetes self-hosting etcd Easy HA Setups on Kubernetes (Tectonic 1.5.5-t.2) Automated backup to object store etcd Operator - Current Work

Slide 23

Slide 23 text

Prometheus Operator ● Operates Prometheus on k8s ● Handles common tasks: ○ Create/Destroy ○ Monitor Configuration ○ Services Targets via Labels ● Configured by resources

Slide 24

Slide 24 text

Prometheus Operator Open-source github.com/coreos/prometheus-operator

Slide 25

Slide 25 text

● Read more at coreos.com/blog ● Test and Extend the open source Operators ● Build and Discuss other Operators (redis, postgres, MySQL) Next Steps

Slide 26

Slide 26 text

https://coreos.com/blog/introducing-operators.html https://coreos.com/blog/introducing-the-etcd-operator.html https://github.com/coreos/etcd-operator https://coreos.com/blog/the-prometheus-operator.html https://github.com/coreos/prometheus-operator https://github.com/kubernetes/minikube URLs

Slide 27

Slide 27 text

CoreOS runs the world’s containers We’re hiring: [email protected] [email protected] 90+ Projects on GitHub, 1,000+ Contributors coreos.com Support plans, training and more OPEN SOURCE ENTERPRISE

Slide 28

Slide 28 text

coreos.com/fest @coreosfest May 31 - June 1, 2017 San Francisco

Slide 29

Slide 29 text

[email protected] @joshixisjosh9 joshix.com QUESTIONS? Thanks! We’re hiring: coreos.com/careers Let’s talk! CoreOS-User google group More events: coreos.com/community LONGER CHAT?

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Ticketmaster - KubeCon 2016 KUBECON MAR 29-30, 2017 Prometheus @ Ticketmaster

Slide 32

Slide 32 text

How Ticketmaster uses Prometheus? ● As we transition to a DevOps model: ○ Replace OpenTSDB ○ Replace legacy alerting systems ○ Each team manages it’s own monitoring and alerting

Slide 33

Slide 33 text

Prometheus POC

Slide 34

Slide 34 text

POC ● Closed to: ○ Teams with already instrumented apps. ○ Teams in the process of migration to AWS. ● Architecture ○ Prometheus and alertmanager running on EC2 instances. ○ Shared between teams.

Slide 35

Slide 35 text

POC architecture

Slide 36

Slide 36 text

Problems & lessons learned Problems: ● Federation scrape timeouts. ● Bad configurations can disrupt the service. ● Tweaking the storage parameters takes time. ● Network acls. Lessons learned: ● Each team should have it’s own prometheus stack. ● Divide and conquer.

Slide 37

Slide 37 text

Prometheus As a Service

Slide 38

Slide 38 text

Prometheus as a service ● Must: ○ Allow the teams to quickly provision a dedicated stack. ○ Don’t represent any additional burden to the teams. ○ Provide pre configured EC2 and k8s service discovery. ○ Helm based deployment. ● Ticketmaster exporter database ○ Provides a well known port range for the exporters. ○ Managed Network ACLs. ○ Scrape jobs are generated from this list.

Slide 39

Slide 39 text

The Prometheus Operator

Slide 40

Slide 40 text

Prometheus Operator ● Allows us to easily model complex configuration. ● Storage configuration is auto tuned. ● Alertmanager HA by default. Looking forward: ● Federation and sharding. ● Grafana integration. Company adoption rate: ● Everyone loves it!.