A practical introduction to observability

A practical introduction to observability Nikolay Stoitsev Engineering Manager @
Halo DX

Monitoring

Monitoring Logging

Monitoring Logging Distributed Tracing

Monitoring

Monitoring system components Application Application Application Monitoring System Time Series
Database Dashboard

Database Dashboard Prometheus, Graphite, m3db

Database Dashboard Prometheus UI, Grafana

Counter

Counter increase

Labels

What to watch out for?

Cardinality

Cardinality • search.success, app_version=1, type=Patient • search.success, app_version=1, type=Exam •
search.success, app_version=2, type=Patient • search.success, app_version=2, type=Exam

#1. Don’t add high cardinality tags

Metrics are not accurate • DB engine optimizes for faster
operations • When performing some operations for a different time resolution • When archiving metrics for long term storage

#2. Don’t rely on metrics infrastructure for BI

Don’t use average values • Averages hide the outliers •
Doesn’t represent typical behavior

Use percentiles • Represents the worst experience in 90% of
the time • Can measure p90, p95, p99 p90

Histograms • Shows the whole distribution • Conﬁgurable buckets

#3. Use percentiles or histograms

Example alert

Alert Levels Send Slack/Teams Message

Alert Levels Send alert to oncall

Alerting tool is usually built into the metrics system

Alerts should be • urgent • important • actionable •
real

Should represent either ongoing or imminent problems

1. Better to remove an alert when it’s noisy

#2. Use success rate

Symptom-based monitoring • Number of 5xx HTTP response codes •
Response time • Email sending is not working • Users can’t log in

Cause-based monitoring • Free disk space on database server •
Memory utilisation • Free ﬁle descriptors

Many causes may trigger a symptom

User impact is most important

#3. Focus on symptom-based alerts

Cause-based alerts are also necessary

Picking alerts to start with Front-end Load Balancer Back-end DB
Count rate of successful log-in Count request success rate

Logging

Logging system Application Application Application Log Aggregation Database Dashboard Log
Collector Log Collector Log Collector Logstash, Fluentd

Collector Log Collector Log Collector Elasticsearch, Loki

Collector Log Collector Log Collector Kibana

Log messages

Finding logs Can search by: • content of log message
message : *notiﬁcation* • all logs from a service kubernetes.labels.app/name.keyword : "api-gateway" • many more thanks to ﬂexible query schema

#1. Use appropriate log level - info, warn, error

Structured logging • Append useful key=value pairs • Can group
(aggregate) by the keys • Can sort by aggregations

#2. Use structured logging

Too many logs Application Application Application Log Aggregation Real Time
Search Engine Log Scraper Log Scraper Log Scraper Dashboard

Search Engine Log Scraper Log Scraper Log Scraper Dashboard Reduce log retention period

Search Engine Log Scraper Log Scraper Log Scraper Dashboard Cold Storage Query UI

#3. Use proper retention period or cold storage

Distributed tracing https://www.youtube.com/watch?v=rM1z7Q1TxR0

End-to-end summary 1. Conﬁgure automated alerts

End-to-end summary 1. Conﬁgure automated alerts 2. Use metrics and
tracing to pinpoint the problem

tracing to pinpoint the problem 3. Use structured logging to ﬁnd the root cause of the problem easily

tracing to pinpoint the problem 3. Use structured logging to ﬁnd the root cause of the problem easily 4. Fix problems and make sure all metrics are always back to normal

Thank you! Q&A Nikolay Stoitsev Engineering Manager at Halo DX
Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels

A practical introduction to observability

A practical introduction to observability

More Decks by Nikolay Stoitsev

Other Decks in Technology

Featured

Transcript