Efficient monitoring
in modern environments
Tobias Schmidt - ContainerDays Hamburg 2016
@dagrobie - github.com/grobie
Slide 2
Slide 2 text
Introduction
About myself
Production Engineer for 5+ years
Container orchestration (in-house, Kubernetes)
Service discovery
Monitoring (Prometheus)
Production readiness
Slide 3
Slide 3 text
Monitoring
Slide 4
Slide 4 text
Collecting, processing, aggregating, and displaying real-
time quantitative data about a system, such as query
counts and types, processing times, and server lifetimes.
Site Reliability Engineering - O’Reilly 2016
Monitoring
Slide 5
Slide 5 text
Monitoring
Slide 6
Slide 6 text
Monitoring
Why monitor?
Enable automatic alerting
Analysis of long-term trends
Validate new features/experiments/implementations
Debugging
Slide 7
Slide 7 text
Monitoring
Blackbox vs. Whitebox
Blackbox: Externally observed
What the user sees
Whitebox: Data exposed by the system
Allows to act on imminent issues
Metrics
Export detailed metrics
Attach all relevant information
Use aggregations later in alerts and dashboards
Slide 11
Slide 11 text
Metrics
Four golden signals
Minimum set of metrics every service should have
Coined by Google SRE
Slide 12
Slide 12 text
Four golden signals
Latency
Time to serve user requests
Median doesn’t reflect user experience
Slide 13
Slide 13 text
Four golden signals
Traffic
Demand placed on a system
(HTTP requests, network throughput, transactions, …)
Slide 14
Slide 14 text
Four golden signals
Errors
Failure responses to user requests
Slide 15
Slide 15 text
Four golden signals
Saturation & Utilization
Consumption of constrained resources
(Memory, I/O, CPU slices, …)
Slide 16
Slide 16 text
Alerting
Slide 17
Slide 17 text
Alerting
Use symptom based alerting
Monitor for your users
Four golden signals (traffic is tricky)
Only page if something needs
immediate human intervention