Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A practical introduction to observability

A practical introduction to observability

Good observability is essential for modern software. It gives us confidence that our systems are working properly. And it also allows us to debug issues efficiently. In this talk, we’ll explore everything you need to know to start applying good observability to your projects. And we’ll see the most common pitfalls you need to be aware of. We will start with the tools and basic concepts in monitoring. And we’ll go over the 3 most common mistakes people make with it. Then we’ll see how to have automatic alerts to detect issues. And, we’ll touch on the principles for setting up good alerts. As a final step, we’ll see how to build our logging system and how to apply it in the most efficient way to debug issues easily.

Nikolay Stoitsev

August 14, 2021
Tweet

More Decks by Nikolay Stoitsev

Other Decks in Technology

Transcript

  1. A practical introduction
    to observability
    Nikolay Stoitsev
    Engineering Manager @ Halo DX

    View full-size slide

  2. Monitoring
    Logging

    View full-size slide

  3. Monitoring
    Logging
    Distributed Tracing

    View full-size slide

  4. Monitoring system components
    Application
    Application
    Application
    Monitoring
    System
    Time Series Database
    Dashboard

    View full-size slide

  5. Monitoring system components
    Application
    Application
    Application
    Monitoring
    System
    Time Series Database
    Dashboard
    Prometheus, Graphite, m3db

    View full-size slide

  6. Monitoring system components
    Application
    Application
    Application
    Monitoring
    System
    Time Series Database
    Dashboard
    Prometheus UI, Grafana

    View full-size slide

  7. Counter increase

    View full-size slide

  8. What to watch out for?

    View full-size slide

  9. Cardinality
    ● search.success, app_version=1, type=Patient
    ● search.success, app_version=1, type=Exam
    ● search.success, app_version=2, type=Patient
    ● search.success, app_version=2, type=Exam

    View full-size slide

  10. #1. Don’t add high
    cardinality tags

    View full-size slide

  11. Metrics are not accurate
    ● DB engine optimizes for faster operations
    ● When performing some operations for a different time resolution
    ● When archiving metrics for long term storage

    View full-size slide

  12. #2. Don’t rely on metrics
    infrastructure for BI

    View full-size slide

  13. Don’t use average values
    ● Averages hide the
    outliers
    ● Doesn’t represent
    typical behavior

    View full-size slide

  14. Use percentiles
    ● Represents the
    worst experience in
    90% of the time
    ● Can measure p90,
    p95, p99
    p90

    View full-size slide

  15. Histograms
    ● Shows the whole
    distribution
    ● Configurable
    buckets

    View full-size slide

  16. #3. Use percentiles or
    histograms

    View full-size slide

  17. Example alert

    View full-size slide

  18. Alert Levels
    Send Slack/Teams Message

    View full-size slide

  19. Alert Levels
    Send alert to oncall

    View full-size slide

  20. Alerting tool is usually built
    into the metrics system

    View full-size slide

  21. Alerts should be
    ● urgent
    ● important
    ● actionable
    ● real

    View full-size slide

  22. Should represent either
    ongoing or imminent
    problems

    View full-size slide

  23. What to watch out for?

    View full-size slide

  24. 1. Better to remove an alert
    when it’s noisy

    View full-size slide

  25. #2. Use success rate

    View full-size slide

  26. Symptom-based monitoring
    ● Number of 5xx HTTP response codes
    ● Response time
    ● Email sending is not working
    ● Users can’t log in

    View full-size slide

  27. Cause-based monitoring
    ● Free disk space on database server
    ● Memory utilisation
    ● Free file descriptors

    View full-size slide

  28. Many causes may trigger a
    symptom

    View full-size slide

  29. User impact is most
    important

    View full-size slide

  30. #3. Focus on
    symptom-based alerts

    View full-size slide

  31. Cause-based alerts are
    also necessary

    View full-size slide

  32. Picking alerts to start with
    Front-end
    Load
    Balancer
    Back-end DB
    Count rate of
    successful
    log-in
    Count
    request
    success rate

    View full-size slide

  33. Logging system
    Application
    Application
    Application
    Log
    Aggregation
    Database
    Dashboard
    Log
    Collector
    Log
    Collector
    Log
    Collector
    Logstash, Fluentd

    View full-size slide

  34. Logging system
    Application
    Application
    Application
    Log
    Aggregation
    Database
    Dashboard
    Log
    Collector
    Log
    Collector
    Log
    Collector
    Elasticsearch, Loki

    View full-size slide

  35. Logging system
    Application
    Application
    Application
    Log
    Aggregation
    Database
    Dashboard
    Log
    Collector
    Log
    Collector
    Log
    Collector
    Kibana

    View full-size slide

  36. Log messages

    View full-size slide

  37. Finding logs
    Can search by:
    ● content of log message
    message : *notification*
    ● all logs from a service
    kubernetes.labels.app/name.keyword : "api-gateway"
    ● many more thanks to flexible query schema

    View full-size slide

  38. What to watch out for?

    View full-size slide

  39. #1. Use appropriate log
    level - info, warn, error

    View full-size slide

  40. Structured logging
    ● Append useful key=value pairs
    ● Can group (aggregate) by the keys
    ● Can sort by aggregations

    View full-size slide

  41. #2. Use structured logging

    View full-size slide

  42. Too many logs
    Application
    Application
    Application
    Log
    Aggregation
    Real Time Search
    Engine
    Log Scraper
    Log Scraper
    Log Scraper
    Dashboard

    View full-size slide

  43. Too many logs
    Application
    Application
    Application
    Log
    Aggregation
    Real Time Search
    Engine
    Log Scraper
    Log Scraper
    Log Scraper
    Dashboard
    Reduce log
    retention period

    View full-size slide

  44. Too many logs
    Application
    Application
    Application
    Log
    Aggregation
    Real Time Search
    Engine
    Log Scraper
    Log Scraper
    Log Scraper
    Dashboard
    Cold Storage
    Query UI

    View full-size slide

  45. #3. Use proper retention
    period or cold storage

    View full-size slide

  46. Distributed tracing
    https://www.youtube.com/watch?v=rM1z7Q1TxR0

    View full-size slide

  47. End-to-end summary
    1. Configure automated alerts

    View full-size slide

  48. End-to-end summary
    1. Configure automated alerts
    2. Use metrics and tracing to pinpoint the problem

    View full-size slide

  49. End-to-end summary
    1. Configure automated alerts
    2. Use metrics and tracing to pinpoint the problem
    3. Use structured logging to find the root cause of the problem easily

    View full-size slide

  50. End-to-end summary
    1. Configure automated alerts
    2. Use metrics and tracing to pinpoint the problem
    3. Use structured logging to find the root cause of the problem easily
    4. Fix problems and make sure all metrics are always back to normal

    View full-size slide

  51. Thank you! Q&A
    Nikolay Stoitsev
    Engineering Manager at Halo DX
    Photo by Pixabay, Şahin Sezer Dinçer, Andrea Piacquadio, Ian Beckley from Pexels

    View full-size slide