ITT 2016 - Björn Rabenstein - Instrumenting Code for Modern Service Monitoring

Instrumenting Code for Modern Service Monitoring Istanbul Tech Talks 2016-04-05
Björn “Beorn” Rabenstein, Production Engineer, SoundCloud Ltd.

Prometheus https://prometheus.io

…refers to an ability to monitor or measure the level
of a product's performance, to diagnose errors and to write trace information. Programmers implement instrumentation in the form of code instructions that monitor specific components in a system (for example, instructions may output logging information to appear on screen). Instrumentation… (according to Wikipedia)

white-box (needs instrumentation) black-box (no changes required) host-based “traditional” service-based
“modern” Dimensions of monitoring

Challenges of service-based monitoring Scale and semantics ➢ Need a
fleet-wide 3,048m view. ◦ What’s my overall 99th percentile latency? ◦ How many qps am I serving on the /foo endpoint? ◦ What’s my error percentage serving from the US-East data center? ➢ Still need to be able to drill down for troubleshooting. ◦ Which instance causes those errors I’m seeing? ◦ How does the canary instance compare to the others in terms of resource usage and latency? ➢ Meaningful alerting. ◦ Symptom-based alerting for pages, cause-based alerting for warnings. ◦ See Rob Ewaschuk’s My philosophy on alerting https://goo.gl/2vrpSO

Monitor everything On all levels. Ideally with the same system.
Level What to monitor (examples) What exposes metrics (example) Host (OS, hardware) Hardware failure, provisioning, host resources. Node exporter Container Resource usage, performance characteristics. cAdvisor Application Latency, errors, qps, internal state. Your own code Orchestration Cluster resources, scheduling. Kubernetes components

Let’s stare at code together A toy microservice Code!

Instrument with logging 12factor.net compliant Code!

Instrument with logging ➢ 10,000 qps served to users ➢
1kiB logs per request ➢ 10MiB of logs to ingest per second ➢ With microservices, each external request easily triggers 100 internal service calls. ➢ Now 1M events and 1GiB of logs per second! ➢ Could downsample.

➢ Gauge: a numerical measurement, goes up and down ➢
Counter: counts things, only ever goes up ➢ Histogram: bucketed counter Metrics! Just send out labelled numbers now and then. Code! Licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license Attribution: Qwfp at English Wikipedia

Why gauges suck and counters rule Graphs shamelessly stolen from
Jamie Wilkinson Δt Δt Δt Δt https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem

Apache has instrumentation: mod_status $ curl http://localhost/server-status/?auto Total Accesses: 126
← Counter apache_accesses_total Total kBytes: 710 ← Counter apache_sent_kilobytes_total CPULoad: .571685 Uptime: 558 ReqPerSec: .225806 ← Gauge rate(apache_accesses_total[10m]) BytesPerSec: 1302.94 ← Gauge rate(apache_sent_kilobytes_total[10m]) BytesPerReq: 5770.16 ← Gauge rate(apache_sent_kilobytes_total[10m]) BusyWorkers: 1 / IdleWorkers: 9 rate(apache_accesses_total[10m]) See https://github.com/neezgee/apache_exporter

Next steps

Collect metrics With Prometheus from… Level What to monitor (examples)
What exposes metrics (example) Host (OS, hardware) Hardware failure, provisioning, host resources. Node exporter Container Resource usage, performance characteristics. cAdvisor Application Latency, errors, qps, internal state. Your own code Orchestration Cluster resources, scheduling. Kubernetes components

Dashboards with

Alerts with Alertmanager

The End

ITT 2016 - Björn Rabenstein - Instrumenting Cod...

ITT 2016 - Björn Rabenstein - Instrumenting Code for Modern Service Monitoring

Istanbul Tech Talks

More Decks by Istanbul Tech Talks

Other Decks in Programming

Featured

Transcript

Instrumenting Code for Modern Service Monitoring Istanbul Tech Talks 2016-04-05

Prometheus https://prometheus.io

…refers to an ability to monitor or measure the level

white-box (needs instrumentation) black-box (no changes required) host-based “traditional” service-based

Challenges of service-based monitoring Scale and semantics ➢ Need a

Monitor everything On all levels. Ideally with the same system.

Let’s stare at code together A toy microservice Code!

Instrument with logging 12factor.net compliant Code!

Instrument with logging ➢ 10,000 qps served to users ➢

➢ Gauge: a numerical measurement, goes up and down ➢

Why gauges suck and counters rule Graphs shamelessly stolen from

Apache has instrumentation: mod_status $ curl http://localhost/server-status/?auto Total Accesses: 126

Next steps

Collect metrics With Prometheus from… Level What to monitor (examples)

Dashboards with

Alerts with Alertmanager

The End