Move over Graphite, Prometheus is here - php[tek]

Move over Graphite, Prometheus is here - php[tek]

We all agree metrics are important, and Graphite’s a great tool for capturing them. However, in the last few years, the metrics space has released lots of great tools that blow Graphite out of the water—one of which is Prometheus from SoundCloud. Prometheus allows you to query any dimension of your data while still storing it in a highly efficient format.

Together, we’ll take a look at how to get started with Prometheus, including how to create dashboards with Grafana and alerts using AlertManager. By the time you leave, you’ll understand how Prometheus works and will be itching to add it to your projects!

Bbf9decfbfc2ab5b450ec503749ded28?s=128

Michael Heap

May 31, 2018
Tweet

Transcript

  1. #phptek @mheap Move over Graphite Prometheus is here

  2. #phptek @mheap Metrics?

  3. #phptek @mheap How much disk space are we using? What’s

    the average CPU utilisation?
  4. #phptek @mheap How many 500 errors in the last 5

    minutes? Is our data processing rate better, worse, or the same as this time last week? How many concurrent users do we have?
  5. #phptek @mheap How many active phone calls are there? What’s

    the average call duration? How many calls have there been to 441234567890 today?
  6. #phptek @mheap

  7. #phptek @mheap

  8. #phptek @mheap call.ringing

  9. #phptek @mheap call.447700900000.ringing

  10. #phptek @mheap ded2585.call.447700900000.ringing

  11. #phptek @mheap ded2585.call.447700900000.ringing ded2585.call.447700900000.answered ded2585.call.447700900000.complete

  12. #phptek @mheap ded2585.call.*.placed

  13. #phptek @mheap *.call.*.placed

  14. #phptek @mheap *.call.447700900000.placed

  15. #phptek @mheap

  16. #phptek @mheap

  17. #phptek @mheap

  18. #phptek @mheap

  19. #phptek @mheap Hello, I’m Michael

  20. #phptek @mheap @mheap

  21. #phptek @mheap

  22. #phptek @mheap

  23. #phptek @mheap Open Source

  24. #phptek @mheap Mostly Go (A little Ruby)

  25. #phptek @mheap Cloud Native Computing Foundation

  26. #phptek @mheap

  27. #phptek @mheap

  28. #phptek @mheap How does it work?

  29. #phptek @mheap Pulls metrics

  30. #phptek @mheap Disk storage

  31. #phptek @mheap Efficient collection

  32. #phptek @mheap Consul integration

  33. #phptek @mheap Prometheus.yml

  34. #phptek @mheap # HELP node_filesystem_free_bytes Filesystem free space in bytes.

    # TYPE node_filesystem_free_bytes gauge node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/ Volumes/Macintosh HD"} 4.9138515968e+10 node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11 node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/ private/var/vm"} 4.36741931008e+11
  35. #phptek @mheap # HELP node_filesystem_free_bytes Filesystem free space in bytes.

  36. #phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

  37. #phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

  38. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  39. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  40. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  41. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  42. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  43. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  44. #phptek @mheap node_filesystem_free_bytes billing_notifications_total process_cpu_seconds_total http_request_duration_seconds

  45. #phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

  46. #phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

  47. #phptek @mheap node_filesystem_free_bytes{ device="/dev/disk1s1", fstype="apfs", mountpoint="/Volumes/Macintosh HD" } 4.9138515968e+10

  48. #phptek @mheap # HELP go_gc_duration_seconds A summary of the GC

    invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 0 go_gc_duration_seconds{quantile="0.25"} 0 go_gc_duration_seconds{quantile="0.5"} 0 go_gc_duration_seconds{quantile="0.75"} 0 go_gc_duration_seconds{quantile="1"} 0 go_gc_duration_seconds_sum 0 go_gc_duration_seconds_count 0 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 6 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.10"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 827952 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 827952 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.443286e+06 # HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 243 # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started. # TYPE go_memstats_gc_cpu_fraction gauge go_memstats_gc_cpu_fraction 0 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 169984 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 827952 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 761856 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 1.990656e+06 # HELP go_memstats_heap_objects Number of allocated objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 7710 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 0 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 2.752512e+06 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 0 # HELP go_memstats_lookups_total Total number of pointer lookups. # TYPE go_memstats_lookups_total counter go_memstats_lookups_total 5 # HELP go_memstats_mallocs_total Total number of mallocs. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 7953 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 6944 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 16384 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 30096 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 32768 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 4.473924e+06 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 1.059618e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 393216 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 393216 # HELP go_memstats_sys_bytes Number of bytes obtained from system. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 5.867768e+06 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 7 # HELP node_cpu_seconds_total Seconds the cpus spent in each mode. # TYPE node_cpu_seconds_total counter node_cpu_seconds_total{cpu="0",mode="idle"} 537140.75 node_cpu_seconds_total{cpu="0",mode="nice"} 0 node_cpu_seconds_total{cpu="0",mode="system"} 202810.13 node_cpu_seconds_total{cpu="0",mode="user"} 236956.35 node_cpu_seconds_total{cpu="1",mode="idle"} 789924.55 node_cpu_seconds_total{cpu="1",mode="nice"} 0 node_cpu_seconds_total{cpu="1",mode="system"} 76430.46 node_cpu_seconds_total{cpu="1",mode="user"} 110379.86 node_cpu_seconds_total{cpu="2",mode="idle"} 521434.82 node_cpu_seconds_total{cpu="2",mode="nice"} 0 node_cpu_seconds_total{cpu="2",mode="system"} 206715.68 node_cpu_seconds_total{cpu="2",mode="user"} 248584.66 node_cpu_seconds_total{cpu="3",mode="idle"} 788754.35 node_cpu_seconds_total{cpu="3",mode="nice"} 0 node_cpu_seconds_total{cpu="3",mode="system"} 76188.77 node_cpu_seconds_total{cpu="3",mode="user"} 111791.47 # HELP node_disk_read_bytes_total The total number of bytes read successfully. # TYPE node_disk_read_bytes_total counter node_disk_read_bytes_total{device="disk0"} 6.22708862976e+11 node_disk_read_bytes_total{device="disk3"} 1.12842752e+08 # HELP node_disk_read_seconds_total The total number of seconds spent by all reads. # TYPE node_disk_read_seconds_total counter node_disk_read_seconds_total{device="disk0"} 22165.627411002 node_disk_read_seconds_total{device="disk3"} 67.88703918 # HELP node_disk_read_sectors_total The total number of sectors read successfully. # TYPE node_disk_read_sectors_total counter node_disk_read_sectors_total{device="disk0"} 4327.06494140625 node_disk_read_sectors_total{device="disk3"} 7.34765625 # HELP node_disk_reads_completed_total The total number of reads completed successfully. # TYPE node_disk_reads_completed_total counter node_disk_reads_completed_total{device="disk0"} 1.7723658e+07 node_disk_reads_completed_total{device="disk3"} 3762 # HELP node_disk_write_seconds_total This is the total number of seconds spent by all writes. # TYPE node_disk_write_seconds_total counter node_disk_write_seconds_total{device="disk0"} 8632.255762983 node_disk_write_seconds_total{device="disk3"} 0 # HELP node_disk_writes_completed_total The total number of writes completed successfully. # TYPE node_disk_writes_completed_total counter node_disk_writes_completed_total{device="disk0"} 1.9779856e+07 node_disk_writes_completed_total{device="disk3"} 0 # HELP node_disk_written_bytes_total The total number of bytes written successfully. # TYPE node_disk_written_bytes_total counter node_disk_written_bytes_total{device="disk0"} 6.94838308864e+11 node_disk_written_bytes_total{device="disk3"} 0 # HELP node_disk_written_sectors_total The total number of sectors written successfully. # TYPE node_disk_written_sectors_total counter node_disk_written_sectors_total{device="disk0"} 4829.06640625 node_disk_written_sectors_total{device="disk3"} 0 # HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which node_exporter was built. # TYPE node_exporter_build_info gauge node_exporter_build_info{branch="HEAD",goversion="go1.10",revision="002c1ca02917406cbecc457162e2bdb1f29c2f49",version="0.16.0-rc.0"} 1 # HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes. # TYPE node_filesystem_avail_bytes gauge node_filesystem_avail_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.7416078336e+10 node_filesystem_avail_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.58532878336e+11 node_filesystem_avail_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 3.58565429248e+11 node_filesystem_avail_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07 node_filesystem_avail_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_avail_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device. # TYPE node_filesystem_device_error gauge node_filesystem_device_error{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_device_error{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_device_error{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_device_error{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 0 node_filesystem_device_error{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_device_error{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_files Filesystem total file nodes. # TYPE node_filesystem_files gauge node_filesystem_files{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18 node_filesystem_files{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294967279e+09 node_filesystem_files{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_files{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_files_free Filesystem total free file nodes. # TYPE node_filesystem_files_free gauge node_filesystem_files_free{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854352e+18 node_filesystem_files_free{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036853541e+18 node_filesystem_files_free{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18 node_filesystem_files_free{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294964965e+09 node_filesystem_files_free{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_files_free{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_free_bytes Filesystem free space in bytes. # TYPE node_filesystem_free_bytes gauge node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.9138515968e+10 node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11 node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.36741931008e+11 node_filesystem_free_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07 node_filesystem_free_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_free_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_readonly Filesystem read-only status. # TYPE node_filesystem_readonly gauge node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_readonly{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1 node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_size_bytes Filesystem size in bytes. # TYPE node_filesystem_size_bytes gauge node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10 node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1.3418496e+08 node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_load1 1m load average. # TYPE node_load1 gauge node_load1 2.451171875 # HELP node_load15 15m load average. # TYPE node_load15 gauge node_load15 2.7646484375 # HELP node_load5 5m load average. # TYPE node_load5 gauge node_load5 2.6083984375 # HELP node_memory_active_bytes_total Memory information field active_bytes_total. # TYPE node_memory_active_bytes_total gauge node_memory_active_bytes_total 3.251331072e+09 # HELP node_memory_bytes_total Memory information field bytes_total. # TYPE node_memory_bytes_total gauge node_memory_bytes_total 1.7179869184e+10 # HELP node_memory_free_bytes_total Memory information field free_bytes_total. # TYPE node_memory_free_bytes_total gauge node_memory_free_bytes_total 5.61926144e+08 # HELP node_memory_inactive_bytes_total Memory information field inactive_bytes_total. # TYPE node_memory_inactive_bytes_total gauge node_memory_inactive_bytes_total 3.997949952e+09 # HELP node_memory_swapped_in_pages_total Memory information field swapped_in_pages_total. # TYPE node_memory_swapped_in_pages_total gauge node_memory_swapped_in_pages_total 2.51926528e+09 # HELP node_memory_swapped_out_pages_total Memory information field swapped_out_pages_total. # TYPE node_memory_swapped_out_pages_total gauge node_memory_swapped_out_pages_total 3.131211776e+09 # HELP node_memory_wired_bytes_total Memory information field wired_bytes_total. # TYPE node_memory_wired_bytes_total gauge node_memory_wired_bytes_total 3.211726848e+09 # HELP node_network_receive_bytes_total Network device statistic receive_bytes. # TYPE node_network_receive_bytes_total counter node_network_receive_bytes_total{device="XHC0"} 0 node_network_receive_bytes_total{device="XHC1"} 0 node_network_receive_bytes_total{device="XHC20"} 0 node_network_receive_bytes_total{device="awdl0"} 5120 node_network_receive_bytes_total{device="bridge0"} 0 node_network_receive_bytes_total{device="en0"} 1.214772224e+09 node_network_receive_bytes_total{device="en1"} 0 node_network_receive_bytes_total{device="en2"} 0 node_network_receive_bytes_total{device="en3"} 0 node_network_receive_bytes_total{device="en4"} 0 node_network_receive_bytes_total{device="en5"} 1.000448e+06 node_network_receive_bytes_total{device="gif0"} 0 node_network_receive_bytes_total{device="lo0"} 2.01657344e+09 node_network_receive_bytes_total{device="p2p0"} 0 node_network_receive_bytes_total{device="stf0"} 0 node_network_receive_bytes_total{device="utun0"} 0 node_network_receive_bytes_total{device="utun1"} 505856 node_network_receive_bytes_total{device="utun2"} 23552 node_network_receive_bytes_total{device="utun3"} 46080 node_network_receive_bytes_total{device="utun4"} 0 node_network_receive_bytes_total{device="utun5"} 0 node_network_receive_bytes_total{device="utun6"} 0 node_network_receive_bytes_total{device="vboxnet0"} 1.631232e+06 # HELP node_network_receive_errs_total Network device statistic receive_errs. # TYPE node_network_receive_errs_total counter node_network_receive_errs_total{device="XHC0"} 0 node_network_receive_errs_total{device="XHC1"} 0 node_network_receive_errs_total{device="XHC20"} 0 node_network_receive_errs_total{device="awdl0"} 0 node_network_receive_errs_total{device="bridge0"} 0 node_network_receive_errs_total{device="en0"} 0 node_network_receive_errs_total{device="en1"} 0 node_network_receive_errs_total{device="en2"} 0 node_network_receive_errs_total{device="en3"} 0 node_network_receive_errs_total{device="en4"} 0 node_network_receive_errs_total{device="en5"} 0 node_network_receive_errs_total{device="gif0"} 0 node_network_receive_errs_total{device="lo0"} 0 node_network_receive_errs_total{device="p2p0"} 0 node_network_receive_errs_total{device="stf0"} 0 node_network_receive_errs_total{device="utun0"} 0 node_network_receive_errs_total{device="utun1"} 0 node_network_receive_errs_total{device="utun2"} 0 node_network_receive_errs_total{device="utun3"} 0 node_network_receive_errs_total{device="utun4"} 0 node_network_receive_errs_total{device="utun5"} 0 node_network_receive_errs_total{device="utun6"} 0 node_network_receive_errs_total{device="vboxnet0"} 0 # HELP node_network_receive_multicast_total Network device statistic receive_multicast. # TYPE node_network_receive_multicast_total counter node_network_receive_multicast_total{device="XHC0"} 0 node_network_receive_multicast_total{device="XHC1"} 0 node_network_receive_multicast_total{device="XHC20"} 0 node_network_receive_multicast_total{device="awdl0"} 33 node_network_receive_multicast_total{device="bridge0"} 0 node_network_receive_multicast_total{device="en0"} 5.331321e+06 node_network_receive_multicast_total{device="en1"} 0 node_network_receive_multicast_total{device="en2"} 0 node_network_receive_multicast_total{device="en3"} 0 node_network_receive_multicast_total{device="en4"} 0 node_network_receive_multicast_total{device="en5"} 4 node_network_receive_multicast_total{device="gif0"} 0 node_network_receive_multicast_total{device="lo0"} 266605 node_network_receive_multicast_total{device="p2p0"} 0 node_network_receive_multicast_total{device="stf0"} 0 node_network_receive_multicast_total{device="utun0"} 0 node_network_receive_multicast_total{device="utun1"} 0 node_network_receive_multicast_total{device="utun2"} 0 node_network_receive_multicast_total{device="utun3"} 0 node_network_receive_multicast_total{device="utun4"} 0 node_network_receive_multicast_total{device="utun5"} 0 node_network_receive_multicast_total{device="utun6"} 0 node_network_receive_multicast_total{device="vboxnet0"} 98 # HELP node_network_receive_packets_total Network device statistic receive_packets. # TYPE node_network_receive_packets_total counter node_network_receive_packets_total{device="XHC0"} 0 node_network_receive_packets_total{device="XHC1"} 0 node_network_receive_packets_total{device="XHC20"} 0 node_network_receive_packets_total{device="awdl0"} 42 node_network_receive_packets_total{device="bridge0"} 0 node_network_receive_packets_total{device="en0"} 5.6394197e+07 node_network_receive_packets_total{device="en1"} 0 node_network_receive_packets_total{device="en2"} 0 node_network_receive_packets_total{device="en3"} 0 node_network_receive_packets_total{device="en4"} 0 node_network_receive_packets_total{device="en5"} 4299 node_network_receive_packets_total{device="gif0"} 0 node_network_receive_packets_total{device="lo0"} 3.243677e+06 node_network_receive_packets_total{device="p2p0"} 0 node_network_receive_packets_total{device="stf0"} 0 node_network_receive_packets_total{device="utun0"} 0 node_network_receive_packets_total{device="utun1"} 3548 node_network_receive_packets_total{device="utun2"} 168 node_network_receive_packets_total{device="utun3"} 226 node_network_receive_packets_total{device="utun4"} 0 node_network_receive_packets_total{device="utun5"} 0 node_network_receive_packets_total{device="utun6"} 0 node_network_receive_packets_total{device="vboxnet0"} 1533 # HELP node_network_transmit_bytes_total Network device statistic transmit_bytes. # TYPE node_network_transmit_bytes_total counter node_network_transmit_bytes_total{device="XHC0"} 0 node_network_transmit_bytes_total{device="XHC1"} 0 node_network_transmit_bytes_total{device="XHC20"} 0 node_network_transmit_bytes_total{device="awdl0"} 1.50016e+06 node_network_transmit_bytes_total{device="bridge0"} 0 node_network_transmit_bytes_total{device="en0"} 2.575358976e+09 node_network_transmit_bytes_total{device="en1"} 0 node_network_transmit_bytes_total{device="en2"} 0 node_network_transmit_bytes_total{device="en3"} 0 node_network_transmit_bytes_total{device="en4"} 0 node_network_transmit_bytes_total{device="en5"} 483328 node_network_transmit_bytes_total{device="gif0"} 0 node_network_transmit_bytes_total{device="lo0"} 2.01657344e+09 node_network_transmit_bytes_total{device="p2p0"} 0 node_network_transmit_bytes_total{device="stf0"} 0 node_network_transmit_bytes_total{device="utun0"} 0 node_network_transmit_bytes_total{device="utun1"} 493568 node_network_transmit_bytes_total{device="utun2"} 23552 node_network_transmit_bytes_total{device="utun3"} 46080 node_network_transmit_bytes_total{device="utun4"} 0 node_network_transmit_bytes_total{device="utun5"} 0 node_network_transmit_bytes_total{device="utun6"} 0 node_network_transmit_bytes_total{device="vboxnet0"} 1.695744e+06 # HELP node_network_transmit_errs_total Network device statistic transmit_errs. # TYPE node_network_transmit_errs_total counter node_network_transmit_errs_total{device="XHC0"} 0 node_network_transmit_errs_total{device="XHC1"} 0 node_network_transmit_errs_total{device="XHC20"} 0 node_network_transmit_errs_total{device="awdl0"} 0 node_network_transmit_errs_total{device="bridge0"} 0 node_network_transmit_errs_total{device="en0"} 0 node_network_transmit_errs_total{device="en1"} 0 node_network_transmit_errs_total{device="en2"} 0 node_network_transmit_errs_total{device="en3"} 0 node_network_transmit_errs_total{device="en4"} 0 node_network_transmit_errs_total{device="en5"} 0 node_network_transmit_errs_total{device="gif0"} 0 node_network_transmit_errs_total{device="lo0"} 0 node_network_transmit_errs_total{device="p2p0"} 0 node_network_transmit_errs_total{device="stf0"} 0 node_network_transmit_errs_total{device="utun0"} 0 node_network_transmit_errs_total{device="utun1"} 0 node_network_transmit_errs_total{device="utun2"} 0 node_network_transmit_errs_total{device="utun3"} 0 node_network_transmit_errs_total{device="utun4"} 0 node_network_transmit_errs_total{device="utun5"} 0 node_network_transmit_errs_total{device="utun6"} 0 node_network_transmit_errs_total{device="vboxnet0"} 0 # HELP node_network_transmit_multicast_total Network device statistic transmit_multicast. # TYPE node_network_transmit_multicast_total counter node_network_transmit_multicast_total{device="XHC0"} 0 node_network_transmit_multicast_total{device="XHC1"} 0 node_network_transmit_multicast_total{device="XHC20"} 0 node_network_transmit_multicast_total{device="awdl0"} 0 node_network_transmit_multicast_total{device="bridge0"} 0 node_network_transmit_multicast_total{device="en0"} 0 node_network_transmit_multicast_total{device="en1"} 0 node_network_transmit_multicast_total{device="en2"} 0 node_network_transmit_multicast_total{device="en3"} 0 node_network_transmit_multicast_total{device="en4"} 0 node_network_transmit_multicast_total{device="en5"} 0 node_network_transmit_multicast_total{device="gif0"} 0 node_network_transmit_multicast_total{device="lo0"} 0 node_network_transmit_multicast_total{device="p2p0"} 0 node_network_transmit_multicast_total{device="stf0"} 0 node_network_transmit_multicast_total{device="utun0"} 0 node_network_transmit_multicast_total{device="utun1"} 0 node_network_transmit_multicast_total{device="utun2"} 0 node_network_transmit_multicast_total{device="utun3"} 0 node_network_transmit_multicast_total{device="utun4"} 0 node_network_transmit_multicast_total{device="utun5"} 0 node_network_transmit_multicast_total{device="utun6"} 0 node_network_transmit_multicast_total{device="vboxnet0"} 0 # HELP node_network_transmit_packets_total Network device statistic transmit_packets. # TYPE node_network_transmit_packets_total counter node_network_transmit_packets_total{device="XHC0"} 0 node_network_transmit_packets_total{device="XHC1"} 0 node_network_transmit_packets_total{device="XHC20"} 0 node_network_transmit_packets_total{device="awdl0"} 6691 node_network_transmit_packets_total{device="bridge0"} 1 node_network_transmit_packets_total{device="en0"} 3.2582836e+07 node_network_transmit_packets_total{device="en1"} 0 node_network_transmit_packets_total{device="en2"} 0 node_network_transmit_packets_total{device="en3"} 0 node_network_transmit_packets_total{device="en4"} 0 node_network_transmit_packets_total{device="en5"} 4145 node_network_transmit_packets_total{device="gif0"} 0 node_network_transmit_packets_total{device="lo0"} 3.243677e+06 node_network_transmit_packets_total{device="p2p0"} 0 node_network_transmit_packets_total{device="stf0"} 0 node_network_transmit_packets_total{device="utun0"} 2 node_network_transmit_packets_total{device="utun1"} 3236 node_network_transmit_packets_total{device="utun2"} 160 node_network_transmit_packets_total{device="utun3"} 223 node_network_transmit_packets_total{device="utun4"} 2 node_network_transmit_packets_total{device="utun5"} 2 node_network_transmit_packets_total{device="utun6"} 2 node_network_transmit_packets_total{device="vboxnet0"} 73766 # HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape. # TYPE node_scrape_collector_duration_seconds gauge node_scrape_collector_duration_seconds{collector="cpu"} 0.00013298 node_scrape_collector_duration_seconds{collector="diskstats"} 0.000803364 node_scrape_collector_duration_seconds{collector="filesystem"} 0.000119007 node_scrape_collector_duration_seconds{collector="loadavg"} 2.3448e-05 node_scrape_collector_duration_seconds{collector="meminfo"} 5.3036e-05 node_scrape_collector_duration_seconds{collector="netdev"} 0.000338404 node_scrape_collector_duration_seconds{collector="textfile"} 1.7727e-05 node_scrape_collector_duration_seconds{collector="time"} 2.8571e-05 # HELP node_scrape_collector_success node_exporter: Whether a collector succeeded. # TYPE node_scrape_collector_success gauge node_scrape_collector_success{collector="cpu"} 1 node_scrape_collector_success{collector="diskstats"} 1 node_scrape_collector_success{collector="filesystem"} 1 node_scrape_collector_success{collector="loadavg"} 1 node_scrape_collector_success{collector="meminfo"} 1 node_scrape_collector_success{collector="netdev"} 1 node_scrape_collector_success{collector="textfile"} 1 node_scrape_collector_success{collector="time"} 1 # HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise # TYPE node_textfile_scrape_error gauge node_textfile_scrape_error 0 # HELP node_time_seconds System time in seconds since epoch (1970). # TYPE node_time_seconds gauge node_time_seconds 1.5210412225783854e+09 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 0 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0
  49. #phptek @mheap # HELP node_filesystem_readonly Filesystem read-only status. # TYPE

    node_filesystem_readonly gauge node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0 node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0 node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0 node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_filesystem_size_bytes Filesystem size in bytes. # TYPE node_filesystem_size_bytes gauge node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10 node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11 node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11 node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0 node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0 # HELP node_load1 1m load average. # TYPE node_load1 gauge node_load1 2.451171875 # HELP node_load15 15m load average. # TYPE node_load15 gauge node_load15 2.7646484375
  50. #phptek @mheap Node Exporter?

  51. #phptek @mheap Exporter?

  52. #phptek @mheap Exposes metrics

  53. #phptek @mheap node_exporter Key Description arp Exposes ARP statistics from

    /proc/net/arp. cpu Exposes CPU statistics filesystem Exposes filesystem statistics, such as disk space used. Itvs Exposes IPVS status from /proc/net/ip_vs and stats from /proc/net/ip_vs_stats. netstat Exposes network statistics from /proc/net/netstat. This is the same information as netstat -s. uname Exposes system information as provided by the uname system call.
  54. #phptek @mheap mysqld_exporter Key Description perf_schema.tablelocks Collect metrics from performance_schema.table_lock_waits_summary_by_table

    info_schema.processlist Collect thread state counts from information_schema.processlist binlog_size Collect the current size of all registered binlog files auto_increment.columns Collect auto_increment columns and max values from information_schema
  55. #phptek @mheap haproxy_exporter Key Description current_queue Current number of queued

    requests assigned to this server current_sessions Current number of active sessions bytes_in_total Current total of incoming bytes connection_errors_total Total of connection errors
  56. #phptek @mheap memcached_exporter Key Description bytes_read Total number of bytes

    read by this server from network connections_total Total number of connections opened since the server started running items_evicted_total Total number of valid items removed from cache to free memory for new items commands_total Total number of all requests broken down by command (get, set, etc.) and status per slab
  57. #phptek @mheap Create your own metrics

  58. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  59. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  60. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  61. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  62. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  63. #phptek @mheap calls_received_total{ network="o2", number="447700900000", type="mobile" } 11

  64. #phptek @mheap Increment a counter Serve on /metrics

  65. #phptek @mheap global: scrape_interval: 15s scrape_configs: - job_name: nexmo_calls static_configs:

    - targets: ['localhost:3000']
  66. #phptek @mheap Increment a counter Serve on /metrics

  67. #phptek @mheap Hard in PHP

  68. #phptek @mheap Pushgateway The Prometheus Pushgateway exists to allow ephemeral

    and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus https://github.com/prometheus/pushgateway https://github.com/Lazyshot/prometheus-php
  69. #phptek @mheap Things to know

  70. #phptek @mheap > 5-10 labels is bad

  71. #phptek @mheap Secure /metrics

  72. #phptek @mheap Metric Types

  73. #phptek @mheap Counters calls_placed_total

  74. #phptek @mheap Gauges calls_active

  75. #phptek @mheap Histograms calls_duration

  76. #phptek @mheap Summaries calls_duration

  77. #phptek @mheap calls_duration_bucket{le="10",network="o2",number="447700900000",type="mobile"} 16 calls_duration_bucket{le="30",network="o2",number="447700900000",type="mobile"} 63 calls_duration_bucket{le="60",network="o2",number="447700900000",type="mobile"} 123 calls_duration_bucket{le="120",network="o2",number="447700900000",type="mobile"} 253

    calls_duration_bucket{le="300",network="o2",number="447700900000",type="mobile"} 618
  78. #phptek @mheap calls_duration_bucket{quantile="0.5"} 85 calls_duration_bucket{quantile="0.9"} 123 calls_duration_bucket{quantile="0.99"} 221 calls_duration_sum 13130

    calls_duration_count 6
  79. #phptek @mheap Counters Use for counting events that happen (e.g.

    total number of requests) and query using rate() Gauge Use to instrument the current state of a metric (e.g. memory usage, jobs in queue) Histograms Use to sample observations in order to analyse distribution of a data set (e.g. request latency) Summaries Use for pre-calculated quantiles on client side, but be mindful of calculation cost and aggregation limitations
  80. #phptek @mheap Show me graphs

  81. #phptek @mheap

  82. #phptek @mheap PromQL

  83. #phptek @mheap calls_placed_total Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 4 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"}

    8 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"} 1 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="o2",number="447700900000",type="mobile"} 6 calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="o2",number="447908249481",type="mobile"} 7
  84. #phptek @mheap calls_placed_total{number="441234567890"} Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 4

  85. #phptek @mheap calls_placed_total{number="441234567890"}[3m] Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 3 @1521482766.23 4

    @1521482769.23 12 @1521482772.229 16 @1521482775.229 21 @1521482778.23 25 @1521482781.23 27 @1521482784.229 31 @1521482787.229 35 @1521482790.229
  86. #phptek @mheap calls_placed_total{number="441234567890"}[3m] offset 1w Element Value calls_placed_total{instance="localhost: 3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"} 2

    @1521311766.23 7 @1521311769.23 18 @1523112772.229 20 @1523112775.229 27 @1523112778.23 28 @1523112781.23 30 @1523112784.229 36 @1523112787.229 39 @1523112790.229
  87. #phptek @mheap # Total number of calls regardless of any

    labels sum(calls_placed_total) # Total number of requests, broken down by the number label sum(calls_placed_total[5m]) by (number) # Total per-second rate over the last 5 minutes by number sum(rate(calls_placed_total[5m])) by (number)
  88. #phptek @mheap sum(calls_placed_total{network="EE", type="mobile"})

  89. #phptek @mheap sum(calls_placed_total{network=~"E.*", type="mobile"})

  90. #phptek @mheap sum(calls_placed_total{network!="EE", type="mobile"})

  91. #phptek @mheap rate(calls_duration_sum{network="EE"} [5m]) / rate(calls_duration_count{network="EE" }[5m])

  92. #phptek @mheap histogram_quantile( 0.95, calls_duration_bucket{number=~"[[number]]"} )

  93. #phptek @mheap sum without (duration) (rate(calls_placed_total{number=~"[[number]]"} [3m]))

  94. #phptek @mheap predict_linear(calls_active[1h], 86400)

  95. #phptek @mheap Grafana

  96. None
  97. None
  98. None
  99. #phptek @mheap Gauges calls_active

  100. #phptek @mheap

  101. #phptek @mheap Counters calls_placed_total

  102. #phptek @mheap

  103. #phptek @mheap

  104. #phptek @mheap Histograms calls_duration

  105. #phptek @mheap

  106. #phptek @mheap

  107. #phptek @mheap

  108. #phptek @mheap Version 5

  109. #phptek @mheap Alertmanager

  110. #phptek @mheap Alertmanager rules

  111. #phptek @mheap alert: CallsMonitorDown expr: Up{job="nexmo_calls"} == 0 for: 5m

    labels: severity: critical
  112. #phptek @mheap alert: LotsOfJobsInQueue expr: sum(jobs_in_queue) > 100 for: 5m

    labels: severity: major
  113. #phptek @mheap alert: DiskFullInFourHours expr: predict_linear(node_filesystem_free{job ="node"}[1h], 4 * 3600)

    < 0 for: 5m labels: severity: major
  114. #phptek @mheap alert: HighCallsBeingPlacedOnLandline expr: rate(calls_placed_total{network=~".*", type="landline"} [1m]) >10 for:

    5m labels: severity: critical annotations: description: 'Unusually high call count on {{ $labels.network }}' summary: 'High call count on {{ $labels.network }}'
  115. #phptek @mheap Alertmanager alerts

  116. #phptek @mheap [ smtp_from: <tmpl_string> ] [ slack_api_url: <string> ]

    [ victorops_api_key: <string> ] [ victorops_api_url: <string> | default = "https://alert.victorops.com/ integrations/generic/20131114/alert/" ] [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/ enqueue" ] [ opsgenie_api_key: <string> ] [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ] [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ] [ hipchat_auth_token: <secret> ]
  117. #phptek @mheap Alertmanager routes

  118. #phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval:

    4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
  119. #phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval:

    4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
  120. #phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval:

    4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
  121. #phptek @mheap route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval:

    4h group_by: [cluster, alertname] - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
  122. #phptek @mheap receivers: - name: 'team-X-mails' email_configs: - to: 'team-X+alerts@example.org

    - name: 'team-X-pager' email_configs: - to: 'team-X+critical@example.org' pagerduty_configs: - routing_key: <team-X-key> - name: 'team-Y-mails' email_configs: - to: 'team-Y+alerts@example.org' - name: 'team-Y-pager' pagerduty_configs: - routing_key: <team-Y-key> - name: 'team-DB-pager' pagerduty_configs: - routing_key: <team-DB-key>
  123. #phptek @mheap Alertmanager inhibits

  124. #phptek @mheap ! Database is down ! User login failure

    > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500
  125. #phptek @mheap ! Database is down ! User login failure

    > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500
  126. #phptek @mheap ! Database is down ! User login failure

    > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500
  127. #phptek @mheap ! Database is down ! User login failure

    > 100 ! Report generation failure > 15 ! GET /healthcheck returned 500
  128. #phptek @mheap The DB is the root cause

  129. #phptek @mheap inhibit_rules: - source_match: alertname: 'UserLoginFailure' target_match: alertname: 'DatabaseDown'

    equal: ['instance']
  130. #phptek @mheap

  131. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  132. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  133. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  134. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  135. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  136. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  137. #phptek @mheap Ingester Grouper Deduplicator Silencer Throttler Notifier

  138. #phptek @mheap Does it scale?

  139. #phptek @mheap Yes.

  140. #phptek @mheap 4.6M time series per server 72k samples ingested

    per second, per server 185 production prometheus servers
  141. #phptek @mheap Prometheus federates

  142. #phptek @mheap Alertmanager gossips

  143. #phptek @mheap So that’s Prometheus

  144. #phptek @mheap So that’s Prometheus (and PromQL, Grafana and Alertmanager)

  145. #phptek @mheap @MHEAP M@MICHAELHEAP.COM HTTPS://JOIND.IN/TALK/845A7