Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Effective Infrastructure Monitoring with Grafana

343deb2fbfa0aff9fc98d9b439eb036c?s=47 David
September 20, 2019

Effective Infrastructure Monitoring with Grafana

In this talk David will show Grafana's advanced features to manage a fleet of Linux hosts. He will also show relevant metrics from node exporter and how they can be turned into alerts.

343deb2fbfa0aff9fc98d9b439eb036c?s=128

David

September 20, 2019
Tweet

Transcript

  1. Effective infrastructure monitoring with Grafana David Kaltschmidt @davkals All Systems

    Go! Sept. 2019 #AllSystemsGo
  2. I’m David Working on Explore, Prometheus, and Loki at Grafana

    Labs Previously: Unifying metrics/logs/traces at Kausal, work on WeaveScope david@grafana.com Twitter: @davkals
  3. Monitoring at Grafana Labs

  4. Monitoring at Grafana Labs K8s on GKE Prometheus for metrics

    Loki for logs Jaeger for distributed tracing Monitoring mixins
  5. Monitoring by alerting Photo by Randy Tarampi on Unsplash

  6. Monitoring by alerting Photo by Randy Tarampi on Unsplash Alerts

    are part of monitoring mixins Ideally linked to a runbook
  7. Prometheus with Node Exporter

  8. Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

    filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs
  9. Node exporter arp bcache bondig boottime conntrack cpu diskstats entropy

    filesystem ipvs loadavg meminfo netclass netstat nfs pressure sockstat stat textfile time uname vmstat xfs zfs Kimchi!!! https://gitlab.com/bjk-gitlab
  10. Mapping of Linux observability tools

  11. System chart Applications System Libraries System Call Interface VFS Sockets

    Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU
  12. System chart with node_exporter metrics: node_…. Applications System Libraries System

    Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU up vmstat filesystem xfs zfs drbd diskstats mdadm bcache conntrack arp netclass netdev wifi infiniband bonding netstat Hardware thermal_zone edac entropy hwmon timex
  13. CPU utilisation: seconds spent on the CPU per second Applications

    System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU (1 - avg without (cpu, mode) ( rate(node_cpu_seconds_total{mode="idle"}[1m]))) * 100
  14. CPU saturation: Load average / number of CPUs Applications System

    Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 / count without (cpu) ( count without (mode) ( node_cpu_seconds_total ) )
  15. CPU saturation: Track waiting time instead of number of processes

    Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU node_load1 node_pressure... Needs Linux (kernel 4.20+ and/or CONFIG_PSI)
  16. Memory utilisation and “saturation” Applications System Libraries System Call Interface

    VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU Virtual Memory DRAM 1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) rate(node_vmstat_pgpgin[1m]) + rate(node_vmstat_pgpgout[1m])
  17. Disk utilisation and disk IO queue length Applications System Libraries

    System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU rate(node_disk_io_time_seconds_total[5m]) iostat avgqu-sz equivalent: rate(node_disk_io_time_weighted_seconds_total[5m]) Details: https://www.robustperception.io/mapping-iostat-to-the-node-exporters-node_disk_-metrics I/O Controller Disk Swap
  18. Available disk space and disk space alerts; multiple alerts with

    varying severity Applications System Libraries System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port CPU DRAM CPU 1 - (max by (device) (node_filesystem_avail_bytes) / max by (device) (node_filesystem_size_bytes)) node_filesystem_avail_bytes/node_filesystem_size_bytes < 0.4 and predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0 and node_filesystem_readonly == 0 for 1h severity: 'warning' Details: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin VFS File Systems Volume Manager Block Device Interface
  19. Network throughput and Applications System Libraries System Call Interface VFS

    Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port rate(node_network_receive_bytes_total{device!="lo"}[5m]) rate(node_network_transmit_bytes_total{device!="lo"}[5m]) rate(node_network_receive_drop_total{device!="lo"}[5m]) rate(node_network_transmit_drop_total{device!="lo"}[5m]) Sockets TCP/UDP IP Ethernet Network Controller
  20. K8s host connectivity issues: conntrack table limits Applications System Libraries

    System Call Interface VFS Sockets Scheduler File Systems TCP/UDP Volume Manager IP Block Device Interface Ethernet Virtual Memory Device Drivers I/O Bridge I/O Controller Disk Network Controller Swap Port Port node_nf_conntrack_entries / node_nf_conntrack_entries_limit Sockets TCP/UDP IP Ethernet
  21. Let’s build a conntrack alert

  22. Collector gotchas - Some collectors produce lots of time-series (filesystem,

    firewalls, networking, systemd) - Some legacy collectors execute scripts or programs - Dropping via relabeling rules is tedious - Pro tip: run 2 two node exporters - one with minimal, one with full configuration - 10x savings on number of time-series - Low overhead since everything is lazily loaded on scrape https://github.com/RichiH Entropy!!
  23. Bonus: Node exporter’s textfile collector Coffee!!! https://github.com/beorn7

  24. Textfile collector - Includes text files from a given directory

    to be part of the scrape - Make sure to write atomically - Don’t let node_exporter run scripts as root, use crontab as root to write output to text file instead # INFO Last time Ansible successfully ran ansible_last_run_timestamp 1568903175 # INFO Last time backup successfully ran backup_last_run_timestamp 1568903175 # INFO Track which features are enabled on host my_bare_metal_feature_enabled 1 # INFO Track SSD wearout smartmon_media_wearout_indicator 95665 https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
  25. Fleet monitoring

  26. Example from dashboards.gitlab.com based on node_uname_info

  27. Deviations: Gamifying fleet management

  28. Fleet overview: Example from dashboards.gitlab.com

  29. Fleet overview: cluster vs. node; Template query: label_values(node_exporter_build_info, instance) Node

    exporter mixin: https://github.com/prometheus/node_exporter/tree/master/docs/node-mixin
  30. Thank you! Questions? @davkals We're hiring remote/EU/US-east. https://grafana.com/about/careers/

  31. Log aggregation with Loki

  32. See your logs in Grafana

  33. Deploy Loki Bare metal options: - static service discovery -

    systemd journal support Example documentation: https://github.com/grafana/loki/blob/master/docs/promtail/examples.md