Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Move over Graphite, Prometheus is here - Longhorn PHP

Move over Graphite, Prometheus is here - Longhorn PHP

We all agree metrics are important, and Graphite’s a great tool for capturing them. However, in the last few years, the metrics space has released lots of great tools that blow Graphite out of the water—one of which is Prometheus from SoundCloud. Prometheus allows you to query any dimension of your data while still storing it in a highly efficient format.

Together, we’ll take a look at how to get started with Prometheus, including how to create dashboards with Grafana and alerts using AlertManager. By the time you leave, you’ll understand how Prometheus works and will be itching to add it to your projects!

Michael Heap

April 20, 2018
Tweet

More Decks by Michael Heap

Other Decks in Technology

Transcript

  1. @mheap
    #longhornphp
    Move over Graphite
    Prometheus is here

    View Slide

  2. @mheap
    #longhornphp
    Metrics?

    View Slide

  3. @mheap
    #longhornphp
    How much disk space are we using?
    What’s the average CPU utilisation?

    View Slide

  4. @mheap
    #longhornphp
    How many 500 errors in the last 5 minutes?
    Is our data processing rate better, worse, or the same as
    this time last week?
    How many concurrent users do we have?

    View Slide

  5. @mheap
    #longhornphp
    How many active phone calls are there?
    What’s the average call duration?
    How many calls have there been to 441234567890
    today?

    View Slide

  6. @mheap
    #longhornphp

    View Slide

  7. @mheap
    #longhornphp

    View Slide

  8. @mheap
    #longhornphp
    call.ringing

    View Slide

  9. @mheap
    #longhornphp
    call.447700900000.ringing

    View Slide

  10. @mheap
    #longhornphp
    ded2585.call.447700900000.ringing

    View Slide

  11. @mheap
    #longhornphp
    ded2585.call.447700900000.ringing
    ded2585.call.447700900000.answered
    ded2585.call.447700900000.complete

    View Slide

  12. @mheap
    #longhornphp
    ded2585.call.*.placed

    View Slide

  13. @mheap
    #longhornphp
    *.call.*.placed

    View Slide

  14. @mheap
    #longhornphp
    *.call.447700900000.placed

    View Slide

  15. @mheap
    #longhornphp

    View Slide

  16. @mheap
    #longhornphp

    View Slide

  17. @mheap
    #longhornphp

    View Slide

  18. @mheap
    #longhornphp

    View Slide

  19. @mheap
    #longhornphp
    Hello, I’m Michael

    View Slide

  20. @mheap
    #longhornphp
    @mheap

    View Slide

  21. @mheap
    #longhornphp

    View Slide

  22. @mheap
    #longhornphp

    View Slide

  23. @mheap
    #longhornphp
    Open Source

    View Slide

  24. @mheap
    #longhornphp
    Mostly Go
    (A little Ruby)

    View Slide

  25. @mheap
    #longhornphp
    Cloud Native
    Computing Foundation

    View Slide

  26. @mheap
    #brightonphp

    View Slide

  27. @mheap
    #longhornphp

    View Slide

  28. @mheap
    #longhornphp
    How does it work?

    View Slide

  29. @mheap
    #longhornphp
    Pulls metrics

    View Slide

  30. @mheap
    #longhornphp
    Disk storage

    View Slide

  31. @mheap
    #longhornphp
    Efficient collection

    View Slide

  32. @mheap
    #longhornphp
    Consul integration

    View Slide

  33. @mheap
    #longhornphp
    Prometheus.yml

    View Slide

  34. @mheap
    #longhornphp
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.
    # TYPE node_filesystem_free_bytes gauge
    node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/
    Volumes/Macintosh HD"} 4.9138515968e+10
    node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"}
    3.62441240576e+11
    node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/
    private/var/vm"} 4.36741931008e+11

    View Slide

  35. @mheap
    #longhornphp
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.

    View Slide

  36. @mheap
    #longhornphp
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View Slide

  37. @mheap
    #longhornphp
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View Slide

  38. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  39. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  40. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  41. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  42. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  43. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  44. @mheap
    #longhornphp
    node_filesystem_free_bytes
    billing_notifications_total
    process_cpu_seconds_total
    http_request_duration_seconds

    View Slide

  45. @mheap
    #longhornphp
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View Slide

  46. @mheap
    #longhornphp
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View Slide

  47. @mheap
    #longhornphp
    node_filesystem_free_bytes{
    device="/dev/disk1s1",
    fstype="apfs",
    mountpoint="/Volumes/Macintosh HD"
    }
    4.9138515968e+10

    View Slide

  48. @mheap
    #longhornphp
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds{quantile="0"} 0
    go_gc_duration_seconds{quantile="0.25"} 0
    go_gc_duration_seconds{quantile="0.5"} 0
    go_gc_duration_seconds{quantile="0.75"} 0
    go_gc_duration_seconds{quantile="1"} 0
    go_gc_duration_seconds_sum 0
    go_gc_duration_seconds_count 0
    # HELP go_goroutines Number of goroutines that currently exist.
    # TYPE go_goroutines gauge
    go_goroutines 6
    # HELP go_info Information about the Go environment.
    # TYPE go_info gauge
    go_info{version="go1.10"} 1
    # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
    # TYPE go_memstats_alloc_bytes gauge
    go_memstats_alloc_bytes 827952
    # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
    # TYPE go_memstats_alloc_bytes_total counter
    go_memstats_alloc_bytes_total 827952
    # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
    # TYPE go_memstats_buck_hash_sys_bytes gauge
    go_memstats_buck_hash_sys_bytes 1.443286e+06
    # HELP go_memstats_frees_total Total number of frees.
    # TYPE go_memstats_frees_total counter
    go_memstats_frees_total 243
    # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
    # TYPE go_memstats_gc_cpu_fraction gauge
    go_memstats_gc_cpu_fraction 0
    # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
    # TYPE go_memstats_gc_sys_bytes gauge
    go_memstats_gc_sys_bytes 169984
    # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
    # TYPE go_memstats_heap_alloc_bytes gauge
    go_memstats_heap_alloc_bytes 827952
    # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
    # TYPE go_memstats_heap_idle_bytes gauge
    go_memstats_heap_idle_bytes 761856
    # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
    # TYPE go_memstats_heap_inuse_bytes gauge
    go_memstats_heap_inuse_bytes 1.990656e+06
    # HELP go_memstats_heap_objects Number of allocated objects.
    # TYPE go_memstats_heap_objects gauge
    go_memstats_heap_objects 7710
    # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
    # TYPE go_memstats_heap_released_bytes gauge
    go_memstats_heap_released_bytes 0
    # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
    # TYPE go_memstats_heap_sys_bytes gauge
    go_memstats_heap_sys_bytes 2.752512e+06
    # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
    # TYPE go_memstats_last_gc_time_seconds gauge
    go_memstats_last_gc_time_seconds 0
    # HELP go_memstats_lookups_total Total number of pointer lookups.
    # TYPE go_memstats_lookups_total counter
    go_memstats_lookups_total 5
    # HELP go_memstats_mallocs_total Total number of mallocs.
    # TYPE go_memstats_mallocs_total counter
    go_memstats_mallocs_total 7953
    # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
    # TYPE go_memstats_mcache_inuse_bytes gauge
    go_memstats_mcache_inuse_bytes 6944
    # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
    # TYPE go_memstats_mcache_sys_bytes gauge
    go_memstats_mcache_sys_bytes 16384
    # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
    # TYPE go_memstats_mspan_inuse_bytes gauge
    go_memstats_mspan_inuse_bytes 30096
    # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
    # TYPE go_memstats_mspan_sys_bytes gauge
    go_memstats_mspan_sys_bytes 32768
    # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
    # TYPE go_memstats_next_gc_bytes gauge
    go_memstats_next_gc_bytes 4.473924e+06
    # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
    # TYPE go_memstats_other_sys_bytes gauge
    go_memstats_other_sys_bytes 1.059618e+06
    # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
    # TYPE go_memstats_stack_inuse_bytes gauge
    go_memstats_stack_inuse_bytes 393216
    # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
    # TYPE go_memstats_stack_sys_bytes gauge
    go_memstats_stack_sys_bytes 393216
    # HELP go_memstats_sys_bytes Number of bytes obtained from system.
    # TYPE go_memstats_sys_bytes gauge
    go_memstats_sys_bytes 5.867768e+06
    # HELP go_threads Number of OS threads created.
    # TYPE go_threads gauge
    go_threads 7
    # HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
    # TYPE node_cpu_seconds_total counter
    node_cpu_seconds_total{cpu="0",mode="idle"} 537140.75
    node_cpu_seconds_total{cpu="0",mode="nice"} 0
    node_cpu_seconds_total{cpu="0",mode="system"} 202810.13
    node_cpu_seconds_total{cpu="0",mode="user"} 236956.35
    node_cpu_seconds_total{cpu="1",mode="idle"} 789924.55
    node_cpu_seconds_total{cpu="1",mode="nice"} 0
    node_cpu_seconds_total{cpu="1",mode="system"} 76430.46
    node_cpu_seconds_total{cpu="1",mode="user"} 110379.86
    node_cpu_seconds_total{cpu="2",mode="idle"} 521434.82
    node_cpu_seconds_total{cpu="2",mode="nice"} 0
    node_cpu_seconds_total{cpu="2",mode="system"} 206715.68
    node_cpu_seconds_total{cpu="2",mode="user"} 248584.66
    node_cpu_seconds_total{cpu="3",mode="idle"} 788754.35
    node_cpu_seconds_total{cpu="3",mode="nice"} 0
    node_cpu_seconds_total{cpu="3",mode="system"} 76188.77
    node_cpu_seconds_total{cpu="3",mode="user"} 111791.47
    # HELP node_disk_read_bytes_total The total number of bytes read successfully.
    # TYPE node_disk_read_bytes_total counter
    node_disk_read_bytes_total{device="disk0"} 6.22708862976e+11
    node_disk_read_bytes_total{device="disk3"} 1.12842752e+08
    # HELP node_disk_read_seconds_total The total number of seconds spent by all reads.
    # TYPE node_disk_read_seconds_total counter
    node_disk_read_seconds_total{device="disk0"} 22165.627411002
    node_disk_read_seconds_total{device="disk3"} 67.88703918
    # HELP node_disk_read_sectors_total The total number of sectors read successfully.
    # TYPE node_disk_read_sectors_total counter
    node_disk_read_sectors_total{device="disk0"} 4327.06494140625
    node_disk_read_sectors_total{device="disk3"} 7.34765625
    # HELP node_disk_reads_completed_total The total number of reads completed successfully.
    # TYPE node_disk_reads_completed_total counter
    node_disk_reads_completed_total{device="disk0"} 1.7723658e+07
    node_disk_reads_completed_total{device="disk3"} 3762
    # HELP node_disk_write_seconds_total This is the total number of seconds spent by all writes.
    # TYPE node_disk_write_seconds_total counter
    node_disk_write_seconds_total{device="disk0"} 8632.255762983
    node_disk_write_seconds_total{device="disk3"} 0
    # HELP node_disk_writes_completed_total The total number of writes completed successfully.
    # TYPE node_disk_writes_completed_total counter
    node_disk_writes_completed_total{device="disk0"} 1.9779856e+07
    node_disk_writes_completed_total{device="disk3"} 0
    # HELP node_disk_written_bytes_total The total number of bytes written successfully.
    # TYPE node_disk_written_bytes_total counter
    node_disk_written_bytes_total{device="disk0"} 6.94838308864e+11
    node_disk_written_bytes_total{device="disk3"} 0
    # HELP node_disk_written_sectors_total The total number of sectors written successfully.
    # TYPE node_disk_written_sectors_total counter
    node_disk_written_sectors_total{device="disk0"} 4829.06640625
    node_disk_written_sectors_total{device="disk3"} 0
    # HELP node_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which node_exporter was built.
    # TYPE node_exporter_build_info gauge
    node_exporter_build_info{branch="HEAD",goversion="go1.10",revision="002c1ca02917406cbecc457162e2bdb1f29c2f49",version="0.16.0-rc.0"} 1
    # HELP node_filesystem_avail_bytes Filesystem space available to non-root users in bytes.
    # TYPE node_filesystem_avail_bytes gauge
    node_filesystem_avail_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.7416078336e+10
    node_filesystem_avail_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.58532878336e+11
    node_filesystem_avail_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 3.58565429248e+11
    node_filesystem_avail_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07
    node_filesystem_avail_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_avail_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_device_error Whether an error occurred while getting statistics for the given device.
    # TYPE node_filesystem_device_error gauge
    node_filesystem_device_error{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_device_error{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_device_error{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_device_error{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 0
    node_filesystem_device_error{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_device_error{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_files Filesystem total file nodes.
    # TYPE node_filesystem_files gauge
    node_filesystem_files{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18
    node_filesystem_files{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294967279e+09
    node_filesystem_files{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_files{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_files_free Filesystem total free file nodes.
    # TYPE node_filesystem_files_free gauge
    node_filesystem_files_free{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 9.223372036854352e+18
    node_filesystem_files_free{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 9.223372036853541e+18
    node_filesystem_files_free{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 9.223372036854776e+18
    node_filesystem_files_free{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 4.294964965e+09
    node_filesystem_files_free{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_files_free{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_free_bytes Filesystem free space in bytes.
    # TYPE node_filesystem_free_bytes gauge
    node_filesystem_free_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 4.9138515968e+10
    node_filesystem_free_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 3.62441240576e+11
    node_filesystem_free_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.36741931008e+11
    node_filesystem_free_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 2.322432e+07
    node_filesystem_free_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_free_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_readonly Filesystem read-only status.
    # TYPE node_filesystem_readonly gauge
    node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_readonly{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1
    node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_size_bytes Filesystem size in bytes.
    # TYPE node_filesystem_size_bytes gauge
    node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 5.9999997952e+10
    node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk3s2",fstype="hfs",mountpoint="/Volumes/Deckset"} 1.3418496e+08
    node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_load1 1m load average.
    # TYPE node_load1 gauge
    node_load1 2.451171875
    # HELP node_load15 15m load average.
    # TYPE node_load15 gauge
    node_load15 2.7646484375
    # HELP node_load5 5m load average.
    # TYPE node_load5 gauge
    node_load5 2.6083984375
    # HELP node_memory_active_bytes_total Memory information field active_bytes_total.
    # TYPE node_memory_active_bytes_total gauge
    node_memory_active_bytes_total 3.251331072e+09
    # HELP node_memory_bytes_total Memory information field bytes_total.
    # TYPE node_memory_bytes_total gauge
    node_memory_bytes_total 1.7179869184e+10
    # HELP node_memory_free_bytes_total Memory information field free_bytes_total.
    # TYPE node_memory_free_bytes_total gauge
    node_memory_free_bytes_total 5.61926144e+08
    # HELP node_memory_inactive_bytes_total Memory information field inactive_bytes_total.
    # TYPE node_memory_inactive_bytes_total gauge
    node_memory_inactive_bytes_total 3.997949952e+09
    # HELP node_memory_swapped_in_pages_total Memory information field swapped_in_pages_total.
    # TYPE node_memory_swapped_in_pages_total gauge
    node_memory_swapped_in_pages_total 2.51926528e+09
    # HELP node_memory_swapped_out_pages_total Memory information field swapped_out_pages_total.
    # TYPE node_memory_swapped_out_pages_total gauge
    node_memory_swapped_out_pages_total 3.131211776e+09
    # HELP node_memory_wired_bytes_total Memory information field wired_bytes_total.
    # TYPE node_memory_wired_bytes_total gauge
    node_memory_wired_bytes_total 3.211726848e+09
    # HELP node_network_receive_bytes_total Network device statistic receive_bytes.
    # TYPE node_network_receive_bytes_total counter
    node_network_receive_bytes_total{device="XHC0"} 0
    node_network_receive_bytes_total{device="XHC1"} 0
    node_network_receive_bytes_total{device="XHC20"} 0
    node_network_receive_bytes_total{device="awdl0"} 5120
    node_network_receive_bytes_total{device="bridge0"} 0
    node_network_receive_bytes_total{device="en0"} 1.214772224e+09
    node_network_receive_bytes_total{device="en1"} 0
    node_network_receive_bytes_total{device="en2"} 0
    node_network_receive_bytes_total{device="en3"} 0
    node_network_receive_bytes_total{device="en4"} 0
    node_network_receive_bytes_total{device="en5"} 1.000448e+06
    node_network_receive_bytes_total{device="gif0"} 0
    node_network_receive_bytes_total{device="lo0"} 2.01657344e+09
    node_network_receive_bytes_total{device="p2p0"} 0
    node_network_receive_bytes_total{device="stf0"} 0
    node_network_receive_bytes_total{device="utun0"} 0
    node_network_receive_bytes_total{device="utun1"} 505856
    node_network_receive_bytes_total{device="utun2"} 23552
    node_network_receive_bytes_total{device="utun3"} 46080
    node_network_receive_bytes_total{device="utun4"} 0
    node_network_receive_bytes_total{device="utun5"} 0
    node_network_receive_bytes_total{device="utun6"} 0
    node_network_receive_bytes_total{device="vboxnet0"} 1.631232e+06
    # HELP node_network_receive_errs_total Network device statistic receive_errs.
    # TYPE node_network_receive_errs_total counter
    node_network_receive_errs_total{device="XHC0"} 0
    node_network_receive_errs_total{device="XHC1"} 0
    node_network_receive_errs_total{device="XHC20"} 0
    node_network_receive_errs_total{device="awdl0"} 0
    node_network_receive_errs_total{device="bridge0"} 0
    node_network_receive_errs_total{device="en0"} 0
    node_network_receive_errs_total{device="en1"} 0
    node_network_receive_errs_total{device="en2"} 0
    node_network_receive_errs_total{device="en3"} 0
    node_network_receive_errs_total{device="en4"} 0
    node_network_receive_errs_total{device="en5"} 0
    node_network_receive_errs_total{device="gif0"} 0
    node_network_receive_errs_total{device="lo0"} 0
    node_network_receive_errs_total{device="p2p0"} 0
    node_network_receive_errs_total{device="stf0"} 0
    node_network_receive_errs_total{device="utun0"} 0
    node_network_receive_errs_total{device="utun1"} 0
    node_network_receive_errs_total{device="utun2"} 0
    node_network_receive_errs_total{device="utun3"} 0
    node_network_receive_errs_total{device="utun4"} 0
    node_network_receive_errs_total{device="utun5"} 0
    node_network_receive_errs_total{device="utun6"} 0
    node_network_receive_errs_total{device="vboxnet0"} 0
    # HELP node_network_receive_multicast_total Network device statistic receive_multicast.
    # TYPE node_network_receive_multicast_total counter
    node_network_receive_multicast_total{device="XHC0"} 0
    node_network_receive_multicast_total{device="XHC1"} 0
    node_network_receive_multicast_total{device="XHC20"} 0
    node_network_receive_multicast_total{device="awdl0"} 33
    node_network_receive_multicast_total{device="bridge0"} 0
    node_network_receive_multicast_total{device="en0"} 5.331321e+06
    node_network_receive_multicast_total{device="en1"} 0
    node_network_receive_multicast_total{device="en2"} 0
    node_network_receive_multicast_total{device="en3"} 0
    node_network_receive_multicast_total{device="en4"} 0
    node_network_receive_multicast_total{device="en5"} 4
    node_network_receive_multicast_total{device="gif0"} 0
    node_network_receive_multicast_total{device="lo0"} 266605
    node_network_receive_multicast_total{device="p2p0"} 0
    node_network_receive_multicast_total{device="stf0"} 0
    node_network_receive_multicast_total{device="utun0"} 0
    node_network_receive_multicast_total{device="utun1"} 0
    node_network_receive_multicast_total{device="utun2"} 0
    node_network_receive_multicast_total{device="utun3"} 0
    node_network_receive_multicast_total{device="utun4"} 0
    node_network_receive_multicast_total{device="utun5"} 0
    node_network_receive_multicast_total{device="utun6"} 0
    node_network_receive_multicast_total{device="vboxnet0"} 98
    # HELP node_network_receive_packets_total Network device statistic receive_packets.
    # TYPE node_network_receive_packets_total counter
    node_network_receive_packets_total{device="XHC0"} 0
    node_network_receive_packets_total{device="XHC1"} 0
    node_network_receive_packets_total{device="XHC20"} 0
    node_network_receive_packets_total{device="awdl0"} 42
    node_network_receive_packets_total{device="bridge0"} 0
    node_network_receive_packets_total{device="en0"} 5.6394197e+07
    node_network_receive_packets_total{device="en1"} 0
    node_network_receive_packets_total{device="en2"} 0
    node_network_receive_packets_total{device="en3"} 0
    node_network_receive_packets_total{device="en4"} 0
    node_network_receive_packets_total{device="en5"} 4299
    node_network_receive_packets_total{device="gif0"} 0
    node_network_receive_packets_total{device="lo0"} 3.243677e+06
    node_network_receive_packets_total{device="p2p0"} 0
    node_network_receive_packets_total{device="stf0"} 0
    node_network_receive_packets_total{device="utun0"} 0
    node_network_receive_packets_total{device="utun1"} 3548
    node_network_receive_packets_total{device="utun2"} 168
    node_network_receive_packets_total{device="utun3"} 226
    node_network_receive_packets_total{device="utun4"} 0
    node_network_receive_packets_total{device="utun5"} 0
    node_network_receive_packets_total{device="utun6"} 0
    node_network_receive_packets_total{device="vboxnet0"} 1533
    # HELP node_network_transmit_bytes_total Network device statistic transmit_bytes.
    # TYPE node_network_transmit_bytes_total counter
    node_network_transmit_bytes_total{device="XHC0"} 0
    node_network_transmit_bytes_total{device="XHC1"} 0
    node_network_transmit_bytes_total{device="XHC20"} 0
    node_network_transmit_bytes_total{device="awdl0"} 1.50016e+06
    node_network_transmit_bytes_total{device="bridge0"} 0
    node_network_transmit_bytes_total{device="en0"} 2.575358976e+09
    node_network_transmit_bytes_total{device="en1"} 0
    node_network_transmit_bytes_total{device="en2"} 0
    node_network_transmit_bytes_total{device="en3"} 0
    node_network_transmit_bytes_total{device="en4"} 0
    node_network_transmit_bytes_total{device="en5"} 483328
    node_network_transmit_bytes_total{device="gif0"} 0
    node_network_transmit_bytes_total{device="lo0"} 2.01657344e+09
    node_network_transmit_bytes_total{device="p2p0"} 0
    node_network_transmit_bytes_total{device="stf0"} 0
    node_network_transmit_bytes_total{device="utun0"} 0
    node_network_transmit_bytes_total{device="utun1"} 493568
    node_network_transmit_bytes_total{device="utun2"} 23552
    node_network_transmit_bytes_total{device="utun3"} 46080
    node_network_transmit_bytes_total{device="utun4"} 0
    node_network_transmit_bytes_total{device="utun5"} 0
    node_network_transmit_bytes_total{device="utun6"} 0
    node_network_transmit_bytes_total{device="vboxnet0"} 1.695744e+06
    # HELP node_network_transmit_errs_total Network device statistic transmit_errs.
    # TYPE node_network_transmit_errs_total counter
    node_network_transmit_errs_total{device="XHC0"} 0
    node_network_transmit_errs_total{device="XHC1"} 0
    node_network_transmit_errs_total{device="XHC20"} 0
    node_network_transmit_errs_total{device="awdl0"} 0
    node_network_transmit_errs_total{device="bridge0"} 0
    node_network_transmit_errs_total{device="en0"} 0
    node_network_transmit_errs_total{device="en1"} 0
    node_network_transmit_errs_total{device="en2"} 0
    node_network_transmit_errs_total{device="en3"} 0
    node_network_transmit_errs_total{device="en4"} 0
    node_network_transmit_errs_total{device="en5"} 0
    node_network_transmit_errs_total{device="gif0"} 0
    node_network_transmit_errs_total{device="lo0"} 0
    node_network_transmit_errs_total{device="p2p0"} 0
    node_network_transmit_errs_total{device="stf0"} 0
    node_network_transmit_errs_total{device="utun0"} 0
    node_network_transmit_errs_total{device="utun1"} 0
    node_network_transmit_errs_total{device="utun2"} 0
    node_network_transmit_errs_total{device="utun3"} 0
    node_network_transmit_errs_total{device="utun4"} 0
    node_network_transmit_errs_total{device="utun5"} 0
    node_network_transmit_errs_total{device="utun6"} 0
    node_network_transmit_errs_total{device="vboxnet0"} 0
    # HELP node_network_transmit_multicast_total Network device statistic transmit_multicast.
    # TYPE node_network_transmit_multicast_total counter
    node_network_transmit_multicast_total{device="XHC0"} 0
    node_network_transmit_multicast_total{device="XHC1"} 0
    node_network_transmit_multicast_total{device="XHC20"} 0
    node_network_transmit_multicast_total{device="awdl0"} 0
    node_network_transmit_multicast_total{device="bridge0"} 0
    node_network_transmit_multicast_total{device="en0"} 0
    node_network_transmit_multicast_total{device="en1"} 0
    node_network_transmit_multicast_total{device="en2"} 0
    node_network_transmit_multicast_total{device="en3"} 0
    node_network_transmit_multicast_total{device="en4"} 0
    node_network_transmit_multicast_total{device="en5"} 0
    node_network_transmit_multicast_total{device="gif0"} 0
    node_network_transmit_multicast_total{device="lo0"} 0
    node_network_transmit_multicast_total{device="p2p0"} 0
    node_network_transmit_multicast_total{device="stf0"} 0
    node_network_transmit_multicast_total{device="utun0"} 0
    node_network_transmit_multicast_total{device="utun1"} 0
    node_network_transmit_multicast_total{device="utun2"} 0
    node_network_transmit_multicast_total{device="utun3"} 0
    node_network_transmit_multicast_total{device="utun4"} 0
    node_network_transmit_multicast_total{device="utun5"} 0
    node_network_transmit_multicast_total{device="utun6"} 0
    node_network_transmit_multicast_total{device="vboxnet0"} 0
    # HELP node_network_transmit_packets_total Network device statistic transmit_packets.
    # TYPE node_network_transmit_packets_total counter
    node_network_transmit_packets_total{device="XHC0"} 0
    node_network_transmit_packets_total{device="XHC1"} 0
    node_network_transmit_packets_total{device="XHC20"} 0
    node_network_transmit_packets_total{device="awdl0"} 6691
    node_network_transmit_packets_total{device="bridge0"} 1
    node_network_transmit_packets_total{device="en0"} 3.2582836e+07
    node_network_transmit_packets_total{device="en1"} 0
    node_network_transmit_packets_total{device="en2"} 0
    node_network_transmit_packets_total{device="en3"} 0
    node_network_transmit_packets_total{device="en4"} 0
    node_network_transmit_packets_total{device="en5"} 4145
    node_network_transmit_packets_total{device="gif0"} 0
    node_network_transmit_packets_total{device="lo0"} 3.243677e+06
    node_network_transmit_packets_total{device="p2p0"} 0
    node_network_transmit_packets_total{device="stf0"} 0
    node_network_transmit_packets_total{device="utun0"} 2
    node_network_transmit_packets_total{device="utun1"} 3236
    node_network_transmit_packets_total{device="utun2"} 160
    node_network_transmit_packets_total{device="utun3"} 223
    node_network_transmit_packets_total{device="utun4"} 2
    node_network_transmit_packets_total{device="utun5"} 2
    node_network_transmit_packets_total{device="utun6"} 2
    node_network_transmit_packets_total{device="vboxnet0"} 73766
    # HELP node_scrape_collector_duration_seconds node_exporter: Duration of a collector scrape.
    # TYPE node_scrape_collector_duration_seconds gauge
    node_scrape_collector_duration_seconds{collector="cpu"} 0.00013298
    node_scrape_collector_duration_seconds{collector="diskstats"} 0.000803364
    node_scrape_collector_duration_seconds{collector="filesystem"} 0.000119007
    node_scrape_collector_duration_seconds{collector="loadavg"} 2.3448e-05
    node_scrape_collector_duration_seconds{collector="meminfo"} 5.3036e-05
    node_scrape_collector_duration_seconds{collector="netdev"} 0.000338404
    node_scrape_collector_duration_seconds{collector="textfile"} 1.7727e-05
    node_scrape_collector_duration_seconds{collector="time"} 2.8571e-05
    # HELP node_scrape_collector_success node_exporter: Whether a collector succeeded.
    # TYPE node_scrape_collector_success gauge
    node_scrape_collector_success{collector="cpu"} 1
    node_scrape_collector_success{collector="diskstats"} 1
    node_scrape_collector_success{collector="filesystem"} 1
    node_scrape_collector_success{collector="loadavg"} 1
    node_scrape_collector_success{collector="meminfo"} 1
    node_scrape_collector_success{collector="netdev"} 1
    node_scrape_collector_success{collector="textfile"} 1
    node_scrape_collector_success{collector="time"} 1
    # HELP node_textfile_scrape_error 1 if there was an error opening or reading a file, 0 otherwise
    # TYPE node_textfile_scrape_error gauge
    node_textfile_scrape_error 0
    # HELP node_time_seconds System time in seconds since epoch (1970).
    # TYPE node_time_seconds gauge
    node_time_seconds 1.5210412225783854e+09
    # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
    # TYPE promhttp_metric_handler_requests_in_flight gauge
    promhttp_metric_handler_requests_in_flight 1
    # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
    # TYPE promhttp_metric_handler_requests_total counter
    promhttp_metric_handler_requests_total{code="200"} 0
    promhttp_metric_handler_requests_total{code="500"} 0
    promhttp_metric_handler_requests_total{code="503"} 0

    View Slide

  49. @mheap
    #longhornphp
    # HELP node_filesystem_readonly Filesystem read-only status.
    # TYPE node_filesystem_readonly gauge
    node_filesystem_readonly{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"} 0
    node_filesystem_readonly{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 0
    node_filesystem_readonly{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"} 0
    node_filesystem_readonly{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_readonly{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_filesystem_size_bytes Filesystem size in bytes.
    # TYPE node_filesystem_size_bytes gauge
    node_filesystem_size_bytes{device="/dev/disk1s1",fstype="apfs",mountpoint="/Volumes/Macintosh HD"}
    5.9999997952e+10
    node_filesystem_size_bytes{device="/dev/disk2s1",fstype="apfs",mountpoint="/"} 4.3996317696e+11
    node_filesystem_size_bytes{device="/dev/disk2s4",fstype="apfs",mountpoint="/private/var/vm"}
    4.3996317696e+11
    node_filesystem_size_bytes{device="map -hosts",fstype="autofs",mountpoint="/net"} 0
    node_filesystem_size_bytes{device="map auto_home",fstype="autofs",mountpoint="/home"} 0
    # HELP node_load1 1m load average.
    # TYPE node_load1 gauge
    node_load1 2.451171875
    # HELP node_load15 15m load average.
    # TYPE node_load15 gauge
    node_load15 2.7646484375

    View Slide

  50. @mheap
    #longhornphp
    Node Exporter?

    View Slide

  51. @mheap
    #longhornphp
    Exporter?

    View Slide

  52. @mheap
    #longhornphp
    Exposes metrics

    View Slide

  53. @mheap
    #longhornphp
    node_exporter
    Key Description
    arp Exposes ARP statistics from /proc/net/arp.
    cpu Exposes CPU statistics
    filesystem Exposes filesystem statistics, such as disk space used.
    Itvs Exposes IPVS status from /proc/net/ip_vs and stats from /proc/net/ip_vs_stats.
    netstat Exposes network statistics from /proc/net/netstat. This is the same information as netstat -s.
    uname Exposes system information as provided by the uname system call.

    View Slide

  54. @mheap
    #longhornphp
    mysqld_exporter
    Key Description
    perf_schema.tablelocks Collect metrics from performance_schema.table_lock_waits_summary_by_table
    info_schema.processlist Collect thread state counts from information_schema.processlist
    binlog_size Collect the current size of all registered binlog files
    auto_increment.columns Collect auto_increment columns and max values from information_schema

    View Slide

  55. @mheap
    #longhornphp
    haproxy_exporter
    Key Description
    current_queue Current number of queued requests assigned to this server
    current_sessions Current number of active sessions
    bytes_in_total Current total of incoming bytes
    connection_errors_total Total of connection errors

    View Slide

  56. @mheap
    #longhornphp
    memcached_exporter
    Key Description
    bytes_read Total number of bytes read by this server from network
    connections_total Total number of connections opened since the server started running
    items_evicted_total Total number of valid items removed from cache to free memory for new items
    commands_total Total number of all requests broken down by command (get, set, etc.) and status per slab

    View Slide

  57. @mheap
    #longhornphp
    Create your own metrics

    View Slide

  58. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  59. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  60. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  61. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  62. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  63. @mheap
    #longhornphp
    calls_received_total{
    network="o2",
    number="447700900000",
    type="mobile"
    }
    11

    View Slide

  64. @mheap
    #longhornphp
    Increment a counter
    Serve on /metrics

    View Slide

  65. @mheap
    #longhornphp
    global:
    scrape_interval: 15s
    scrape_configs:
    - job_name: nexmo_calls
    static_configs:
    - targets: ['localhost:3000']

    View Slide

  66. @mheap
    #longhornphp
    Increment a counter
    Serve on /metrics

    View Slide

  67. @mheap
    #longhornphp
    Hard in PHP

    View Slide

  68. @mheap
    #longhornphp
    Pushgateway
    The Prometheus Pushgateway exists to allow ephemeral and batch
    jobs to expose their metrics to Prometheus. Since these kinds of
    jobs may not exist long enough to be scraped, they can instead
    push their metrics to a Pushgateway. The Pushgateway then
    exposes these metrics to Prometheus
    https://github.com/prometheus/pushgateway
    https://github.com/Lazyshot/prometheus-php

    View Slide

  69. @mheap
    #longhornphp
    Things to know

    View Slide

  70. @mheap
    #longhornphp
    > 5-10 labels is bad

    View Slide

  71. @mheap
    #longhornphp
    Secure /metrics

    View Slide

  72. @mheap
    #longhornphp
    Metric Types

    View Slide

  73. @mheap
    #longhornphp
    Counters
    calls_placed_total

    View Slide

  74. @mheap
    #longhornphp
    Gauges
    calls_active

    View Slide

  75. @mheap
    #longhornphp
    Histograms
    calls_duration

    View Slide

  76. @mheap
    #longhornphp
    Summaries
    calls_duration

    View Slide

  77. @mheap
    #longhornphp
    calls_duration_bucket{le="10",network="o2",number="447700900000",type="mobile"} 16
    calls_duration_bucket{le="30",network="o2",number="447700900000",type="mobile"} 63
    calls_duration_bucket{le="60",network="o2",number="447700900000",type="mobile"} 123
    calls_duration_bucket{le="120",network="o2",number="447700900000",type="mobile"} 253
    calls_duration_bucket{le="300",network="o2",number="447700900000",type="mobile"} 618

    View Slide

  78. @mheap
    #longhornphp
    calls_duration_bucket{quantile="0.5"} 85
    calls_duration_bucket{quantile="0.9"} 123
    calls_duration_bucket{quantile="0.99"} 221
    calls_duration_sum 13130
    calls_duration_count 6

    View Slide

  79. @mheap
    #longhornphp
    Counters
    Use for counting events that happen (e.g. total number of requests) and query using rate()
    Gauge
    Use to instrument the current state of a metric (e.g. memory usage, jobs in queue)
    Histograms
    Use to sample observations in order to analyse distribution of a data set (e.g. request latency)
    Summaries
    Use for pre-calculated quantiles on client side, but be mindful of calculation cost and aggregation limitations

    View Slide

  80. @mheap
    #longhornphp
    Show me graphs

    View Slide

  81. @mheap
    #longhornphp

    View Slide

  82. @mheap
    #longhornphp
    PromQL

    View Slide

  83. @mheap
    #longhornphp
    calls_placed_total
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    4
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"}
    8
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="442079460000",type="landline"}
    1
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="o2",number="447700900000",type="mobile"}
    6
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="o2",number="447908249481",type="mobile"}
    7

    View Slide

  84. @mheap
    #longhornphp
    calls_placed_total{number="441234567890"}
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    4

    View Slide

  85. @mheap
    #longhornphp
    calls_placed_total{number="441234567890"}[3m]
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    3 @1521482766.23

    4 @1521482769.23

    12 @1521482772.229

    16 @1521482775.229

    21 @1521482778.23

    25 @1521482781.23

    27 @1521482784.229

    31 @1521482787.229

    35 @1521482790.229

    View Slide

  86. @mheap
    #longhornphp
    calls_placed_total{number="441234567890"}[3m]
    offset 1w
    Element Value
    calls_placed_total{instance="localhost:
    3000",job="nexmo_calls",network="BT",number="441234567890",type="landline"}
    2 @1521311766.23

    7 @1521311769.23

    18 @1523112772.229

    20 @1523112775.229

    27 @1523112778.23

    28 @1523112781.23

    30 @1523112784.229

    36 @1523112787.229

    39 @1523112790.229

    View Slide

  87. @mheap
    #longhornphp
    # Total number of calls regardless of any labels
    sum(calls_placed_total)
    # Total number of requests, broken down by the number label
    sum(calls_placed_total[5m]) by (number)
    # Total per-second rate over the last 5 minutes by number
    sum(rate(calls_placed_total[5m])) by (number)

    View Slide

  88. @mheap
    #longhornphp
    sum(calls_placed_total{network="EE", type="mobile"})

    View Slide

  89. @mheap
    #longhornphp
    sum(calls_placed_total{network=~"E.*", type="mobile"})

    View Slide

  90. @mheap
    #longhornphp
    sum(calls_placed_total{network!="EE", type="mobile"})

    View Slide

  91. @mheap
    #longhornphp
    rate(calls_duration_sum{network="EE"}
    [5m])
    /
    rate(calls_duration_count{network="EE"
    }[5m])

    View Slide

  92. @mheap
    #longhornphp
    histogram_quantile(
    0.95,
    calls_duration_bucket{number=~"[[number]]"}
    )

    View Slide

  93. @mheap
    #longhornphp
    sum without (duration)
    (rate(calls_placed_total{number=~"[[number]]"}
    [3m]))

    View Slide

  94. @mheap
    #longhornphp
    predict_linear(calls_active[1h], 86400)

    View Slide

  95. @mheap
    #longhornphp
    Grafana

    View Slide

  96. View Slide

  97. View Slide

  98. View Slide

  99. @mheap
    #longhornphp
    Gauges
    calls_active

    View Slide

  100. @mheap
    #longhornphp

    View Slide

  101. @mheap
    #longhornphp
    Counters
    calls_placed_total

    View Slide

  102. @mheap
    #longhornphp

    View Slide

  103. @mheap
    #longhornphp

    View Slide

  104. @mheap
    #longhornphp
    Histograms
    calls_duration

    View Slide

  105. @mheap
    #longhornphp

    View Slide

  106. @mheap
    #longhornphp

    View Slide

  107. @mheap
    #longhornphp

    View Slide

  108. @mheap
    #longhornphp
    Version 5

    View Slide

  109. @mheap
    #longhornphp
    Alertmanager

    View Slide

  110. @mheap
    #longhornphp
    Alertmanager
    rules

    View Slide

  111. @mheap
    #longhornphp
    alert: CallsMonitorDown
    expr: Up{job="nexmo_calls"} == 0
    for: 5m
    labels:
    severity: critical

    View Slide

  112. @mheap
    #longhornphp
    alert: LotsOfJobsInQueue
    expr: sum(jobs_in_queue) > 100
    for: 5m
    labels:
    severity: major

    View Slide

  113. @mheap
    #longhornphp
    alert: DiskFullInFourHours
    expr:
    predict_linear(node_filesystem_free{job
    ="node"}[1h], 4 * 3600) < 0
    for: 5m
    labels:
    severity: major

    View Slide

  114. @mheap
    #longhornphp
    alert: HighCallsBeingPlacedOnLandline
    expr: rate(calls_placed_total{network=~".*", type="landline"}
    [1m]) >10
    for: 5m
    labels:
    severity: critical
    annotations:
    description: 'Unusually high call count on
    {{ $labels.network }}'
    summary: 'High call count on {{ $labels.network }}'

    View Slide

  115. @mheap
    #longhornphp
    Alertmanager
    alerts

    View Slide

  116. @mheap
    #longhornphp
    [ smtp_from: ]
    [ slack_api_url: ]
    [ victorops_api_key: ]
    [ victorops_api_url: | default = "https://alert.victorops.com/
    integrations/generic/20131114/alert/" ]
    [ pagerduty_url: | default = "https://events.pagerduty.com/v2/
    enqueue" ]
    [ opsgenie_api_key: ]
    [ opsgenie_api_url: | default = "https://api.opsgenie.com/" ]
    [ hipchat_api_url: | default = "https://api.hipchat.com/" ]
    [ hipchat_auth_token: ]

    View Slide

  117. @mheap
    #longhornphp
    Alertmanager
    routes

    View Slide

  118. @mheap
    #longhornphp
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View Slide

  119. @mheap
    #longhornphp
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View Slide

  120. @mheap
    #longhornphp
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View Slide

  121. @mheap
    #longhornphp
    route:
    receiver: 'default-receiver'
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 4h
    group_by: [cluster, alertname]
    - receiver: 'database-pager'
    group_wait: 10s
    match_re:
    service: mysql|cassandra
    - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
    team: frontend

    View Slide

  122. @mheap
    #longhornphp
    receivers:
    - name: 'team-X-mails'
    email_configs:
    - to: '[email protected]
    - name: 'team-X-pager'
    email_configs:
    - to: '[email protected]'
    pagerduty_configs:
    - routing_key:
    - name: 'team-Y-mails'
    email_configs:
    - to: '[email protected]'
    - name: 'team-Y-pager'
    pagerduty_configs:
    - routing_key:
    - name: 'team-DB-pager'
    pagerduty_configs:
    - routing_key:

    View Slide

  123. @mheap
    #longhornphp
    Alertmanager
    inhibits

    View Slide

  124. @mheap
    #longhornphp
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View Slide

  125. @mheap
    #longhornphp
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View Slide

  126. @mheap
    #longhornphp
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View Slide

  127. @mheap
    #longhornphp
    ! Database is down
    ! User login failure > 100
    ! Report generation failure > 15
    ! GET /healthcheck returned 500

    View Slide

  128. @mheap
    #longhornphp
    The DB is the
    root cause

    View Slide

  129. @mheap
    #longhornphp
    inhibit_rules:
    - source_match:
    alertname: 'UserLoginFailure'
    target_match:
    alertname: 'DatabaseDown'
    equal: ['instance']

    View Slide

  130. @mheap
    #longhornphp

    View Slide

  131. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  132. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  133. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  134. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  135. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  136. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  137. @mheap
    #longhornphp
    Ingester
    Grouper
    Deduplicator
    Silencer
    Throttler
    Notifier

    View Slide

  138. @mheap
    #longhornphp
    Does it scale?

    View Slide

  139. @mheap
    #longhornphp
    Yes.

    View Slide

  140. @mheap
    #longhornphp
    4.6M time series per server
    72k samples ingested per second, per server
    185 production prometheus servers

    View Slide

  141. @mheap
    #longhornphp
    Prometheus
    federates

    View Slide

  142. @mheap
    #longhornphp
    Alertmanager
    gossips

    View Slide

  143. @mheap
    #longhornphp
    So that’s Prometheus

    View Slide

  144. @mheap
    #longhornphp
    So that’s Prometheus
    (and PromQL, Grafana and Alertmanager)

    View Slide

  145. @mheap
    #longhornphp
    I’LL TAKE QUESTIONS OFF STAGE
    @MHEAP
    HTTPS://JOIND.IN/TALK/B0900

    View Slide