Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Production Backbone Monitoring Containerized Apps

Production Backbone Monitoring Containerized Apps

Brandon Philips

October 13, 2017
Tweet

More Decks by Brandon Philips

Other Decks in Programming

Transcript

  1. Monitoring is the backbone of a production application. With many

    teams teams shifting their application infrastructure to Kubernetes and containerization, new opportunities to introduce better monitoring and alerting open up. Come learn about the fundamentals of container monitoring, best practices, and the abstractions Kubernetes gives teams to create production-ready monitoring infrastructure. We will discuss Prometheus and Kubernetes and the demos will be done on the CoreOS Kubernetes platform, Tectonic.
  2. $ while read host; ssh $host … < hosts Problems:

    No monitoring, no state to recover
  3. - Hybrid Core - Consistent installation, management, and automated operations

    across AWS, Azure, VMWare, and Bare-metal - Enterprise Governance - Federation with corp identity and enforcement of access across all interfaces - Cloud Services - Etcd and Prometheus open cloud services - Monitoring and Management - Powered by Prometheus https://coreos.com/tectonic
  4. • Multi-Dimensional time series • Metrics, not logging, not tracing

    • No magic! • Prometheus- Container Native Monitoring
  5. A lot of traffic to monitor Monitoring should consume fraction

    of user traffic Solution: Compact metrics format
  6. Target (container) /metrics # HELP http_requests_total Total number of HTTP

    requests made. # TYPE http_requests_total counter http_requests_total{code="200",path="/status"} 8
  7. Target (container) /metrics # HELP http_requests_total Total number of HTTP

    requests made. # TYPE http_requests_total counter http_requests_total{code="200",path="/status"} 8 Metric name
  8. Target (container) /metrics # HELP http_requests_total Total number of HTTP

    requests made. # TYPE http_requests_total counter http_requests_total{code="200",path="/status"} 8 Label
  9. Target (container) /metrics # HELP http_requests_total Total number of HTTP

    requests made. # TYPE http_requests_total counter http_requests_total{code="200",path="/status"} 8 Value
  10. Demo - real metrics endpoint • Deploy example app •

    See example app in Console • Visit the website and metrics site
  11. Web w0xjp frontend philips prod Web 7wtk3 frontend rithu dev

    Web 09xtx backend rithu dev Web c010m backend philips prod
  12. Web w0xjp frontend philips prod Web 7wtk3 frontend rithu dev

    Web 09xtx backend rithu dev Web c010m backend philips prod
  13. Web w0xjp frontend philips prod Web 7wtk3 frontend rithu dev

    Web 09xtx backend rithu dev Web c010m backend philips prod
  14. Web w0xjp frontend philips dev Web 7wtk3 frontend rithu prod

    Web 09xtx backend rithu dev Web c010m backend philips prod
  15. • Spin up new Prometheus instance • Prometheus will select

    targets based on labels Demo - prometheus targets
  16. • Show all metrics • Build a query for selecting

    all pods latency • Build a query to summarize Demo - querying time series data
  17. • Make node unschedulable • Cause containers to be scheduled

    • Correlate node "outage" and deployment stall Demo - correlating two time series
  18. Prometheus Alerts ALERT <alert name> IF <PromQL vector expression> FOR

    <duration> LABELS { ... } ANNOTATIONS { ... } <elem1> <val1> <elem2> <val2> <elem3> <val3> ... Each result entry is one alert:
  19. ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ANNOTATIONS

    { summary = “device filling up”, description = “{{$labels.device}} mounted on {{$labels.mountpoint}} on {{$labels.instance}} will fill up within 4 hours.” }
  20. • Critical metrics on nodes • Graph those metrics •

    Introduce the concept of alerting Demo - alerting on exceptional data
  21. - Hybrid Core - Consistent installation, management, and automated operations

    across AWS, Azure, VMWare, and Bare-metal - Enterprise Governance - Federation with corp identity and enforcement of access across all interfaces - Cloud Services - Etcd and Prometheus open cloud services - Monitoring and Management - Powered by Prometheus https://coreos.com/tectonic
  22. Prometheus Alerts ALERT <alert name> IF <PromQL vector expression> FOR

    <duration> LABELS { ... } ANNOTATIONS { ... } <elem1> <val1> <elem2> <val2> <elem3> <val3> ... Each result entry is one alert:
  23. Prometheus Alerts ALERT EtcdNoLeader IF etcd_has_leader == 0 FOR 1m

    LABELS { severity=”page” } {job=”etcd”,instance=”A”} 0.0 {job=”etcd”,instance=”B”} 0.0 {job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”A”} {job=”etcd”,alertname=”EtcdNoLeader”,severity=”page”,instance=”B”}
  24. ALERT HighErrorRate IF sum rate(request_errors_total[5m])) > 500 {} 534 Ehhh

    Absolute threshold alerting rule needs constant tuning as traffic changes
  25. ALERT HighErrorRate IF sum rate(request_errors_total[5m]) / sum rate(requests_total[5m]) * 100

    > 1 {} 1.8354 Meehh No dimensionality in result loss of detail, signal cancelation
  26. ALERT HighErrorRate IF sum rate(request_errors_total[5m]) / sum rate(requests_total[5m]) * 100

    > 1 {} 1.8354 high error / low traffic low error / high traffic total sum
  27. ALERT HighErrorRate IF sum by(instance, path) rate(request_errors_total[5m]) / sum by(instance,

    path) rate(requests_total[5m]) * 100 > 0.01 {instance=”web-2”, path=”/api/comments”} 2.435 {instance=”web-1”, path=”/api/comments”} 1.0055 {instance=”web-2”, path=”/api/profile”} 34.124
  28. ALERT HighErrorRate IF sum by(instance, path) rate(request_errors_total[5m]) / sum by(instance,

    path) rate(requests_total[5m]) * 100 > 1 {instance=”web-2”, path=”/api/v1/comments”} 2.435 ... Booo Wrong dimensions aggregates away dimensions of fault-tolerance
  29. ALERT HighErrorRate IF sum by(instance, path) rate(request_errors_total[5m]) / sum by(instance,

    path) rate(requests_total[5m]) * 100 > 1 {instance=”web-2”, path=”/api/v1/comments”} 2.435 ... instance 1 instance 2..1000
  30. ALERT HighErrorRate IF sum without(instance) rate(request_errors_total[5m]) / sum without(instance) rate(requests_total[5m])

    * 100 > 1 {method=”GET”, path=”/api/v1/comments”} 2.435 {method=”POST”, path=”/api/v1/comments”} 1.0055 {method=”POST”, path=”/api/v1/profile”} 34.124
  31. ALERT DiskWillFillIn4Hours IF predict_linear(node_filesystem_free{job='node'}[1h], 4*3600) < 0 FOR 5m ANNOTATIONS

    { summary = “device filling up”, description = “{{$labels.device}} mounted on {{$labels.mountpoint}} on {{$labels.instance}} will fill up within 4 hours.” }