Get Instrumented: How Prometheus Can Unify Your Metrics

Get Instrumented: How Prometheus Can Unify Your Metrics

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

May 31, 2016
Tweet

Transcript

  1. Hynek Schlawack Get Instrumented How Prometheus Can Unify Your Metrics

  2. Goals

  3. Goals

  4. Goals

  5. Goals

  6. Goals

  7. Service Level

  8. Service Level Indicator

  9. Service Level Indicator Objective

  10. Service Level Indicator Objective (Agreement)

  11. Metrics

  12. Metrics avg latency 0.3 0.5 0.8 1.1 2.6

  13. Metrics 12:00 12:01 12:02 12:03 12:04 avg latency 0.3 0.5

    0.8 1.1 2.6
  14. Metrics 12:00 12:01 12:02 12:03 12:04 avg latency 0.3 0.5

    0.8 1.1 2.6 server load 0.3 1.0 2.3 3.5 5.2
  15. None
  16. Instrument

  17. Instrument

  18. Instrument

  19. Instrument

  20. Instrument

  21. None
  22. None
  23. Metric Types

  24. Metric Types ❖counter

  25. Metric Types ❖counter ❖gauge

  26. Metric Types ❖counter ❖gauge ❖summary

  27. Metric Types ❖counter ❖gauge ❖summary ❖histogram

  28. Metric Types ❖counter ❖gauge ❖summary ❖histogram ❖ buckets (1s, 0.5s,

    0.25, …)
  29. Averages

  30. ❖ avg(request time) ≠ avg(UX) Averages

  31. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 Averages
  32. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 Averages
  33. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 Averages
  34. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
  35. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
  36. ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 ❖ median({1, 1, 100_000}) = 1 Averages
  37. Percentiles

  38. Percentiles nth percentile P of a data set = P

    ≥ n% of values
  39. None
  40. 50th percentile = 1 ms

  41. 50th percentile = 1 ms 50% of requests done by

    1 ms
  42. Percentiles

  43. Percentiles P {1, 1, 100_000} 50th 1

  44. Percentiles P {1, 1, 100_000} 50th 1 95th 90_000

  45. None
  46. None
  47. None
  48. Naming

  49. Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get …

  50. Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total

  51. Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total

  52. Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total

  53. Naming backend1_app_http_reqs_msgs_post backend1_app_http_reqs_msgs_get … app_http_reqs_total{meth="POST", path="/msgs", backend="1"} app_http_reqs_total{meth="GET", path="/msgs", backend="1"}

    … app_http_reqs_total
  54. None
  55. None
  56. 1. resolution = scraping interval

  57. 1. resolution = scraping interval 2. missing scrapes = less

    resolution
  58. Pull: Problems ❖ short lived jobs

  59. None
  60. Pull: Problems ❖ short lived jobs ❖ target discovery

  61. Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'

  62. Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'

  63. Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'

  64. Configuration scrape_configs: - job_name: 'prometheus' target_groups: - targets: - 'localhost:9090'

    {instance="localhost:9090",job="prometheus"}
  65. None
  66. Pull: Problems ❖ target discovery ❖ short lived jobs ❖

    Heroku/NATed systems
  67. Pull: Advantages

  68. Pull: Advantages ❖ multiple Prometheis easy

  69. Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection

  70. Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection ❖

    predictable, no self-DoS
  71. Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection ❖

    predictable, no self-DoS ❖ easy to instrument 3rd parties
  72. Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  73. Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  74. Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  75. Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  76. Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  77. Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}

    388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
  78. Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}

    388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
  79. Percentiles req_seconds_bucket{le="0.05"} 0.0 req_seconds_bucket{le="0.25"} 1.0 req_seconds_bucket{le="0.5"} 273.0 req_seconds_bucket{le="0.75"} 369.0 req_seconds_bucket{le="1.0"}

    388.0 req_seconds_bucket{le="2.0"} 390.0 req_seconds_bucket{le="+Inf"} 390.0
  80. None
  81. Aggregation

  82. Aggregation sum( rate( req_seconds_count[1m] ) )

  83. Aggregation sum( rate( req_seconds_count[1m] ) )

  84. Aggregation sum( rate( req_seconds_count[1m] ) )

  85. Aggregation sum( rate( req_seconds_count[1m] ) )

  86. Aggregation sum( rate( req_seconds_count{dc="west"}[1m] ) )

  87. Aggregation sum( rate( req_seconds_count[1m] ) ) by (dc)

  88. Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

  89. Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

  90. Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

  91. Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

  92. Percentiles histogram_quantile( 0.9, rate( req_seconds_bucket[10m] ))

  93. None
  94. None
  95. Internal

  96. Internal ❖ great for ad-hoc

  97. Internal ❖ great for ad-hoc ❖ 1 expr per graph

  98. Internal ❖ great for ad-hoc ❖ 1 expr per graph

    ❖ templating
  99. PromDash

  100. PromDash ❖ best integration

  101. PromDash ❖ best integration ❖ former official

  102. PromDash ❖ best integration ❖ former official ❖ now deprecated

    ❖ don’t bother
  103. Grafana

  104. Grafana ❖ pretty & powerful

  105. Grafana ❖ pretty & powerful ❖ many integrations

  106. Grafana ❖ pretty & powerful ❖ many integrations ❖ mix

    and match!
  107. Grafana ❖ pretty & powerful ❖ many integrations ❖ mix

    and match! ❖ use this!
  108. None
  109. Alerts & Scrying

  110. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  111. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  112. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  113. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  114. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  115. Alerts & Scrying ALERT DiskWillFillIn4Hours IF predict_linear( node_filesystem_free[1h], 4*3600) <

    0 FOR 5m
  116. None
  117. None
  118. None
  119. Environment

  120. None
  121. Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd

    Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
  122. Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd

    Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
  123. node_exporter

  124. node_exporter cAdvisor

  125. System Insight

  126. System Insight ❖ load

  127. System Insight ❖ load ❖ procs

  128. System Insight ❖ load ❖ procs ❖ memory

  129. System Insight ❖ load ❖ procs ❖ memory ❖ network

  130. System Insight ❖ load ❖ procs ❖ memory ❖ network

    ❖ disk
  131. System Insight ❖ load ❖ procs ❖ memory ❖ network

    ❖ disk ❖ I/O
  132. mtail

  133. mtail ❖ follow (log) files

  134. mtail ❖ follow (log) files ❖ extract metrics using regex

  135. mtail ❖ follow (log) files ❖ extract metrics using regex

    ❖ can be better than direct
  136. Moar

  137. Moar ❖ Edges: web servers/HAProxy

  138. Moar ❖ Edges: web servers/HAProxy ❖ black box

  139. Moar ❖ Edges: web servers/HAProxy ❖ black box ❖ databases

  140. Moar ❖ Edges: web servers/HAProxy ❖ black box ❖ databases

    ❖ network
  141. So Far

  142. So Far ❖ system stats

  143. So Far ❖ system stats ❖ outside look

  144. So Far ❖ system stats ❖ outside look ❖ 3rd

    party components
  145. Code

  146. cat-or.not

  147. cat-or.not ❖ HTTP service

  148. cat-or.not ❖ HTTP service ❖ upload picture

  149. cat-or.not ❖ HTTP service ❖ upload picture ❖ meow!/nope meow!

  150. from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  151. from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  152. from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  153. pip install prometheus_client

  154. from prometheus_client import \ start_http_server # … if __name__ ==

    "__main__": start_http_server(8000) app.run()
  155. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  156. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  157. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  158. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  159. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  160. process_virtual_memory_bytes 156393472.0 process_resident_memory_bytes 20480000.0 process_start_time_seconds 1460214325.21 process_cpu_seconds_total 0.169999999998 process_open_fds 8.0

    process_max_fds 1024.0
  161. None
  162. from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",

    "Time spent in HTTP requests.")
  163. from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",

    "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.")
  164. from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",

    "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.") IN_PROGRESS = Gauge( "cat_or_not_in_progress_requests", "Number of requests in progress.")
  165. @IN_PROGRESS.track_inprogress() @REQUEST_TIME.time() @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result

    = is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"
  166. @IN_PROGRESS.track_inprogress() @REQUEST_TIME.time() @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result

    = is_cat( request.files["pic"].stream) return "meow!" if result else "nope!"
  167. AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  168. AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  169. AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  170. AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  171. @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) with ANALYZE_TIME.time(): result = is_cat(

    request.files["pic"].stream) return "meow!" if result else "nope!"
  172. pip install prometheus_async

  173. Wrapper from prometheus_async.aio import time @time(REQUEST_TIME) async def view(request): #

    ...
  174. Goodies

  175. Goodies ❖ aiohttp-based metrics export

  176. Goodies ❖ aiohttp-based metrics export ❖ also in thread!

  177. Goodies ❖ aiohttp-based metrics export ❖ also in thread! ❖

    Consul Agent integration
  178. Wrap Up

  179. Wrap Up

  180. Wrap Up ✓

  181. Wrap Up ✓ ✓

  182. Wrap Up ✓ ✓ ✓

  183. ox.cx/p @hynek vrmd.de