Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Operators Principle and Practice

Kubernetes Operators Principle and Practice

with Jesus Carillo, TicketMaster - Cloud Native Conf - Berlin, Germany - http://sched.co/9Td4

Josh Wood

March 29, 2017
Tweet

More Decks by Josh Wood

Other Decks in Technology

Transcript

  1. Kubernetes Operators: Managing Complex Software with Software Josh Wood DocOps

    at CoreOS @joshixisjosh9 Jesus Carrillo Sr Systems Engineer at Ticketmaster
  2. • Resize/Upgrade - coordination for availability • Reconfigure - tedious

    generation / templating • Backup - requires coordination on instances • Healing - restore backups, rejoin Managing a Distributed Database is Harder
  3. etcd Overview • Distributed key-value store • Primary datastore of

    Kubernetes • Auto-leader election for availability
  4. Operator Construction • Operators build on Kubernetes concepts • Resources:

    who what where ; desired state • Controllers: Observe, Analyze, Act to reconcile resources
  5. Third Party Resources • TPRs extend the Kubernetes API with

    new API object types • Akin to a database table’s schema - the data model • Designed with custom automation mechanisms in mind • https://kubernetes.io/docs/user-guide/thirdpartyresources/
  6. Kubernetes self-hosting etcd Easy HA Setups on Kubernetes (Tectonic 1.5.5-t.2)

    Automated backup to object store etcd Operator - Current Work
  7. Prometheus Operator • Operates Prometheus on k8s • Handles common

    tasks: ◦ Create/Destroy ◦ Monitor Configuration ◦ Services Targets via Labels • Configured by resources
  8. • Read more at coreos.com/blog • Test and Extend the

    open source Operators • Build and Discuss other Operators (redis, postgres, MySQL) Next Steps
  9. CoreOS runs the world’s containers We’re hiring: [email protected] [email protected] 90+

    Projects on GitHub, 1,000+ Contributors coreos.com Support plans, training and more OPEN SOURCE ENTERPRISE
  10. How Ticketmaster uses Prometheus? • As we transition to a

    DevOps model: ◦ Replace OpenTSDB ◦ Replace legacy alerting systems ◦ Each team manages it’s own monitoring and alerting
  11. POC • Closed to: ◦ Teams with already instrumented apps.

    ◦ Teams in the process of migration to AWS. • Architecture ◦ Prometheus and alertmanager running on EC2 instances. ◦ Shared between teams.
  12. Problems & lessons learned Problems: • Federation scrape timeouts. •

    Bad configurations can disrupt the service. • Tweaking the storage parameters takes time. • Network acls. Lessons learned: • Each team should have it’s own prometheus stack. • Divide and conquer.
  13. Prometheus as a service • Must: ◦ Allow the teams

    to quickly provision a dedicated stack. ◦ Don’t represent any additional burden to the teams. ◦ Provide pre configured EC2 and k8s service discovery. ◦ Helm based deployment. • Ticketmaster exporter database ◦ Provides a well known port range for the exporters. ◦ Managed Network ACLs. ◦ Scrape jobs are generated from this list.
  14. Prometheus Operator • Allows us to easily model complex configuration.

    • Storage configuration is auto tuned. • Alertmanager HA by default. Looking forward: • Federation and sharding. • Grafana integration. Company adoption rate: • Everyone loves it!.