Kubernetes Operators:
Managing Complex Software with
Software
Josh Wood
DocOps at CoreOS
@joshixisjosh9
Jesus Carrillo
Sr Systems Engineer at
Ticketmaster
Slide 2
Slide 2 text
Overview
Scaling Stateless Apps
Slide 3
Slide 3 text
$ kubectl scale replicas=3 ...
ReplicaSet
Slide 4
Slide 4 text
ReplicaSet
Slide 5
Slide 5 text
ReplicaSet
Slide 6
Slide 6 text
No content
Slide 7
Slide 7 text
Overview
What about apps that…store
data
Slide 8
Slide 8 text
$ kubectl run db --image=quay.io/my/db
Creating a Database is Easy
Slide 9
Slide 9 text
● Resize/Upgrade - coordination for availability
● Reconfigure - tedious generation / templating
● Backup - requires coordination on instances
● Healing - restore backups, rejoin
Managing a Distributed Database is Harder
etcd Overview
● Distributed key-value store
● Primary datastore of Kubernetes
● Auto-leader election for availability
Slide 16
Slide 16 text
Operator Construction
● Operators build on Kubernetes concepts
● Resources: who what where ; desired state
● Controllers: Observe, Analyze, Act to reconcile resources
Slide 17
Slide 17 text
Third Party Resources
● TPRs extend the Kubernetes API with new API object types
● Akin to a database table’s schema - the data model
● Designed with custom automation mechanisms in mind
● https://kubernetes.io/docs/user-guide/thirdpartyresources/
CoreOS runs the world’s containers
We’re hiring: [email protected][email protected]
90+ Projects on GitHub, 1,000+ Contributors
coreos.com
Support plans, training and more
OPEN SOURCE ENTERPRISE
Slide 28
Slide 28 text
coreos.com/fest
@coreosfest
May 31 - June 1, 2017
San Francisco
Slide 29
Slide 29 text
[email protected]
@joshixisjosh9
joshix.com
QUESTIONS?
Thanks!
We’re hiring: coreos.com/careers
Let’s talk!
CoreOS-User google group
More events: coreos.com/community
LONGER CHAT?
How Ticketmaster uses Prometheus?
● As we transition to a DevOps model:
○ Replace OpenTSDB
○ Replace legacy alerting systems
○ Each team manages it’s own monitoring and alerting
Slide 33
Slide 33 text
Prometheus
POC
Slide 34
Slide 34 text
POC
● Closed to:
○ Teams with already instrumented apps.
○ Teams in the process of migration to AWS.
● Architecture
○ Prometheus and alertmanager running on EC2 instances.
○ Shared between teams.
Slide 35
Slide 35 text
POC architecture
Slide 36
Slide 36 text
Problems & lessons learned
Problems:
● Federation scrape timeouts.
● Bad configurations can disrupt the service.
● Tweaking the storage parameters takes time.
● Network acls.
Lessons learned:
● Each team should have it’s own prometheus stack.
● Divide and conquer.
Slide 37
Slide 37 text
Prometheus
As a Service
Slide 38
Slide 38 text
Prometheus as a service
● Must:
○ Allow the teams to quickly provision a dedicated stack.
○ Don’t represent any additional burden to the teams.
○ Provide pre configured EC2 and k8s service discovery.
○ Helm based deployment.
● Ticketmaster exporter database
○ Provides a well known port range for the exporters.
○ Managed Network ACLs.
○ Scrape jobs are generated from this list.
Slide 39
Slide 39 text
The
Prometheus
Operator
Slide 40
Slide 40 text
Prometheus Operator
● Allows us to easily model complex configuration.
● Storage configuration is auto tuned.
● Alertmanager HA by default.
Looking forward:
● Federation and sharding.
● Grafana integration.
Company adoption rate:
● Everyone loves it!.