Metrics are not accurate
● DB engine optimizes for faster operations
● When performing some operations for a different time resolution
● When archiving metrics for long term storage
Slide 20
Slide 20 text
#2. Don’t rely on metrics
infrastructure for BI
Slide 21
Slide 21 text
Don’t use average values
● Averages hide the
outliers
● Doesn’t represent
typical behavior
Slide 22
Slide 22 text
Use percentiles
● Represents the
worst experience in
90% of the time
● Can measure p90,
p95, p99
p90
Slide 23
Slide 23 text
Histograms
● Shows the whole
distribution
● Configurable
buckets
Slide 24
Slide 24 text
#3. Use percentiles or
histograms
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
Example alert
Slide 27
Slide 27 text
Alert Levels
Send Slack/Teams Message
Slide 28
Slide 28 text
Alert Levels
Send alert to oncall
Slide 29
Slide 29 text
Alerting tool is usually built
into the metrics system
Slide 30
Slide 30 text
Alerts should be
● urgent
● important
● actionable
● real
Slide 31
Slide 31 text
Should represent either
ongoing or imminent
problems
Slide 32
Slide 32 text
What to watch out for?
Slide 33
Slide 33 text
1. Better to remove an alert
when it’s noisy
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
#2. Use success rate
Slide 36
Slide 36 text
Symptom-based monitoring
● Number of 5xx HTTP response codes
● Response time
● Email sending is not working
● Users can’t log in
Slide 37
Slide 37 text
Cause-based monitoring
● Free disk space on database server
● Memory utilisation
● Free file descriptors
Slide 38
Slide 38 text
Many causes may trigger a
symptom
Slide 39
Slide 39 text
User impact is most
important
Slide 40
Slide 40 text
#3. Focus on
symptom-based alerts
Slide 41
Slide 41 text
Cause-based alerts are
also necessary
Slide 42
Slide 42 text
Picking alerts to start with
Front-end
Load
Balancer
Back-end DB
Count rate of
successful
log-in
Count
request
success rate
Finding logs
Can search by:
● content of log message
message : *notification*
● all logs from a service
kubernetes.labels.app/name.keyword : "api-gateway"
● many more thanks to flexible query schema
Slide 50
Slide 50 text
What to watch out for?
Slide 51
Slide 51 text
#1. Use appropriate log
level - info, warn, error
Slide 52
Slide 52 text
Structured logging
● Append useful key=value pairs
● Can group (aggregate) by the keys
● Can sort by aggregations
Slide 53
Slide 53 text
#2. Use structured logging
Slide 54
Slide 54 text
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Slide 55
Slide 55 text
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Reduce log
retention period
Slide 56
Slide 56 text
Too many logs
Application
Application
Application
Log
Aggregation
Real Time Search
Engine
Log Scraper
Log Scraper
Log Scraper
Dashboard
Cold Storage
Query UI
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
Slide 62
Slide 62 text
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
Slide 63
Slide 63 text
End-to-end summary
1. Configure automated alerts
2. Use metrics and tracing to pinpoint the problem
3. Use structured logging to find the root cause of the problem easily
4. Fix problems and make sure all metrics are always back to normal
Slide 64
Slide 64 text
Thank you! Q&A
Nikolay Stoitsev
Engineering Manager at Halo DX
Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels