Efficient Monitoring in modern environments

Slide 1

Slide 1 text

Efficient monitoring in modern environments Tobias Schmidt - ContainerDays Hamburg 2016 @dagrobie - github.com/grobie

Slide 2

Slide 2 text

Introduction About myself Production Engineer for 5+ years Container orchestration (in-house, Kubernetes) Service discovery Monitoring (Prometheus) Production readiness

Slide 3

Slide 3 text

Monitoring

Slide 4

Slide 4 text

Collecting, processing, aggregating, and displaying real- time quantitative data about a system, such as query counts and types, processing times, and server lifetimes. Site Reliability Engineering - O’Reilly 2016 Monitoring

Slide 5

Slide 5 text

Monitoring

Slide 6

Slide 6 text

Monitoring Why monitor? Enable automatic alerting Analysis of long-term trends Validate new features/experiments/implementations Debugging

Slide 7

Slide 7 text

Monitoring Blackbox vs. Whitebox Blackbox: Externally observed What the user sees Whitebox: Data exposed by the system Allows to act on imminent issues

Slide 8

Slide 8 text

Metrics

Slide 9

Slide 9 text

Metrics Instrument everything Host (CPU, memory, I/O, network, filesystem, …) Container (CPU, memory, restarts, OOM, throttling, …) Applications (throughput, latency, queues, …)

Slide 10

Slide 10 text

Metrics Export detailed metrics Attach all relevant information Use aggregations later in alerts and dashboards

Slide 11

Slide 11 text

Metrics Four golden signals Minimum set of metrics every service should have Coined by Google SRE

Slide 12

Slide 12 text

Four golden signals Latency Time to serve user requests Median doesn’t reflect user experience

Slide 13

Slide 13 text

Four golden signals Traffic Demand placed on a system (HTTP requests, network throughput, transactions, …)

Slide 14

Slide 14 text

Four golden signals Errors Failure responses to user requests

Slide 15

Slide 15 text

Four golden signals Saturation & Utilization Consumption of constrained resources (Memory, I/O, CPU slices, …)

Slide 16

Slide 16 text

Alerting

Slide 17

Slide 17 text

Alerting Use symptom based alerting Monitor for your users Four golden signals (traffic is tricky) Only page if something needs immediate human intervention

Slide 18

Slide 18 text

Alerting Prevent alert fatigue Alert grouping Provide easy silencing Dependencies Avoid static thresholds

Slide 19

Slide 19 text

Alerting Use ticketing system Avoid email spam Warnings are tasks like new features

Slide 20

Slide 20 text

Alerting Provide runbooks (playbooks) Keep them concise Explanation, hints, links Dynamic - include recent observations Discuss with non-experts

Slide 21

Slide 21 text

Alerting Practice outages “Game days” Repeat regularly

Slide 22

Slide 22 text

Matt T. Proud, Julius Volz, Björn Rabenstein, Matthias Rampke Philosophy on Alerting - Rob Ewaschuk Acknowledgements

Slide 23

Slide 23 text

Thank you May the queries flow, and your pagers be quiet. Tobias Schmidt - ContainerDays Hamburg 2016 @dagrobie - github.com/grobie