Slide 1

Slide 1 text

A practical introduction to observability Nikolay Stoitsev Engineering Manager @ Halo DX

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Monitoring

Slide 5

Slide 5 text

Monitoring Logging

Slide 6

Slide 6 text

Monitoring Logging Distributed Tracing

Slide 7

Slide 7 text

Monitoring

Slide 8

Slide 8 text

Monitoring system components Application Application Application Monitoring System Time Series Database Dashboard

Slide 9

Slide 9 text

Monitoring system components Application Application Application Monitoring System Time Series Database Dashboard Prometheus, Graphite, m3db

Slide 10

Slide 10 text

Monitoring system components Application Application Application Monitoring System Time Series Database Dashboard Prometheus UI, Grafana

Slide 11

Slide 11 text

Counter

Slide 12

Slide 12 text

Counter increase

Slide 13

Slide 13 text

Timer

Slide 14

Slide 14 text

Labels

Slide 15

Slide 15 text

What to watch out for?

Slide 16

Slide 16 text

Cardinality

Slide 17

Slide 17 text

Cardinality ● search.success, app_version=1, type=Patient ● search.success, app_version=1, type=Exam ● search.success, app_version=2, type=Patient ● search.success, app_version=2, type=Exam

Slide 18

Slide 18 text

#1. Don’t add high cardinality tags

Slide 19

Slide 19 text

Metrics are not accurate ● DB engine optimizes for faster operations ● When performing some operations for a different time resolution ● When archiving metrics for long term storage

Slide 20

Slide 20 text

#2. Don’t rely on metrics infrastructure for BI

Slide 21

Slide 21 text

Don’t use average values ● Averages hide the outliers ● Doesn’t represent typical behavior

Slide 22

Slide 22 text

Use percentiles ● Represents the worst experience in 90% of the time ● Can measure p90, p95, p99 p90

Slide 23

Slide 23 text

Histograms ● Shows the whole distribution ● Configurable buckets

Slide 24

Slide 24 text

#3. Use percentiles or histograms

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Example alert

Slide 27

Slide 27 text

Alert Levels Send Slack/Teams Message

Slide 28

Slide 28 text

Alert Levels Send alert to oncall

Slide 29

Slide 29 text

Alerting tool is usually built into the metrics system

Slide 30

Slide 30 text

Alerts should be ● urgent ● important ● actionable ● real

Slide 31

Slide 31 text

Should represent either ongoing or imminent problems

Slide 32

Slide 32 text

What to watch out for?

Slide 33

Slide 33 text

1. Better to remove an alert when it’s noisy

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

#2. Use success rate

Slide 36

Slide 36 text

Symptom-based monitoring ● Number of 5xx HTTP response codes ● Response time ● Email sending is not working ● Users can’t log in

Slide 37

Slide 37 text

Cause-based monitoring ● Free disk space on database server ● Memory utilisation ● Free file descriptors

Slide 38

Slide 38 text

Many causes may trigger a symptom

Slide 39

Slide 39 text

User impact is most important

Slide 40

Slide 40 text

#3. Focus on symptom-based alerts

Slide 41

Slide 41 text

Cause-based alerts are also necessary

Slide 42

Slide 42 text

Picking alerts to start with Front-end Load Balancer Back-end DB Count rate of successful log-in Count request success rate

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Logging

Slide 45

Slide 45 text

Logging system Application Application Application Log Aggregation Database Dashboard Log Collector Log Collector Log Collector Logstash, Fluentd

Slide 46

Slide 46 text

Logging system Application Application Application Log Aggregation Database Dashboard Log Collector Log Collector Log Collector Elasticsearch, Loki

Slide 47

Slide 47 text

Logging system Application Application Application Log Aggregation Database Dashboard Log Collector Log Collector Log Collector Kibana

Slide 48

Slide 48 text

Log messages

Slide 49

Slide 49 text

Finding logs Can search by: ● content of log message message : *notification* ● all logs from a service kubernetes.labels.app/name.keyword : "api-gateway" ● many more thanks to flexible query schema

Slide 50

Slide 50 text

What to watch out for?

Slide 51

Slide 51 text

#1. Use appropriate log level - info, warn, error

Slide 52

Slide 52 text

Structured logging ● Append useful key=value pairs ● Can group (aggregate) by the keys ● Can sort by aggregations

Slide 53

Slide 53 text

#2. Use structured logging

Slide 54

Slide 54 text

Too many logs Application Application Application Log Aggregation Real Time Search Engine Log Scraper Log Scraper Log Scraper Dashboard

Slide 55

Slide 55 text

Too many logs Application Application Application Log Aggregation Real Time Search Engine Log Scraper Log Scraper Log Scraper Dashboard Reduce log retention period

Slide 56

Slide 56 text

Too many logs Application Application Application Log Aggregation Real Time Search Engine Log Scraper Log Scraper Log Scraper Dashboard Cold Storage Query UI

Slide 57

Slide 57 text

#3. Use proper retention period or cold storage

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

Distributed tracing https://www.youtube.com/watch?v=rM1z7Q1TxR0

Slide 60

Slide 60 text

End-to-end summary 1. Configure automated alerts

Slide 61

Slide 61 text

End-to-end summary 1. Configure automated alerts 2. Use metrics and tracing to pinpoint the problem

Slide 62

Slide 62 text

End-to-end summary 1. Configure automated alerts 2. Use metrics and tracing to pinpoint the problem 3. Use structured logging to find the root cause of the problem easily

Slide 63

Slide 63 text

End-to-end summary 1. Configure automated alerts 2. Use metrics and tracing to pinpoint the problem 3. Use structured logging to find the root cause of the problem easily 4. Fix problems and make sure all metrics are always back to normal

Slide 64

Slide 64 text

Thank you! Q&A Nikolay Stoitsev Engineering Manager at Halo DX Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels