Get Instrumented: How Prometheus Can Unify Your Metrics

Get Instrumented: How Prometheus Can Unify Your Metrics

174e7b0ff60963f821d0b9a4f1a3ef52?s=128

Hynek Schlawack

May 31, 2016
Tweet

Transcript

  1. 2.
  2. 3.
  3. 4.
  4. 5.
  5. 6.
  6. 11.
  7. 14.

    Metrics 12:00 12:01 12:02 12:03 12:04 avg latency 0.3 0.5

    0.8 1.1 2.6 server load 0.3 1.0 2.3 3.5 5.2
  8. 15.
  9. 21.
  10. 22.
  11. 29.
  12. 34.

    ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
  13. 35.

    ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 Averages
  14. 36.

    ❖ avg(request time) ≠ avg(UX) ❖ avg({1, 1, 1, 1,

    10}) = 2.8 ❖ median({1, 1, 1, 1, 10}) = 1 ❖ median({1, 1, 100_000}) = 1 Averages
  15. 39.
  16. 45.
  17. 46.
  18. 47.
  19. 48.
  20. 54.
  21. 55.
  22. 59.
  23. 65.
  24. 71.

    Pull: Advantages ❖ multiple Prometheis easy ❖ outage detection ❖

    predictable, no self-DoS ❖ easy to instrument 3rd parties
  25. 72.

    Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  26. 73.

    Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  27. 74.

    Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  28. 75.

    Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  29. 76.

    Metrics Format # HELP req_seconds Time spent \ processing a

    request in seconds. # TYPE req_seconds histogram req_seconds_count 390.0 req_seconds_sum 177.0319407
  30. 80.
  31. 93.
  32. 94.
  33. 95.
  34. 99.
  35. 103.
  36. 108.
  37. 116.
  38. 117.
  39. 118.
  40. 120.
  41. 121.

    Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd

    Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
  42. 122.

    Apache nginx Django PostgreSQL MySQL MongoDB CouchDB redis Varnish etcd

    Kubernetes Consul collectd HAProxy statsd graphite InfluxDB SNMP
  43. 132.
  44. 136.
  45. 141.
  46. 145.
  47. 146.
  48. 150.

    from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  49. 151.

    from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  50. 152.

    from flask import Flask, g, request from cat_or_not import is_cat

    app = Flask(__name__) @app.route("/analyze", methods=["POST"]) def analyze(): g.auth.check(request) return ("meow!" if is_cat(request.files["pic"]) else "nope!") if __name__ == "__main__": app.run()
  51. 154.

    from prometheus_client import \ start_http_server # … if __name__ ==

    "__main__": start_http_server(8000) app.run()
  52. 161.
  53. 163.

    from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",

    "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.")
  54. 164.

    from prometheus_client import \ Histogram, Gauge REQUEST_TIME = Histogram( "cat_or_not_request_seconds",

    "Time spent in HTTP requests.") ANALYZE_TIME = Histogram( "cat_or_not_analyze_seconds", "Time spent analyzing pictures.") IN_PROGRESS = Gauge( "cat_or_not_in_progress_requests", "Number of requests in progress.")
  55. 167.

    AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  56. 168.

    AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  57. 169.

    AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  58. 170.

    AUTH_TIME = Histogram("auth_seconds", "Time spent authenticating.") AUTH_ERRS = Counter("auth_errors_total", "Errors

    while authing.") AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total", "Wrong credentials.") class Auth: # ... @AUTH_TIME.time() def auth(self, request): while True: try: return self._auth(request) except WrongCredsError: AUTH_WRONG_CREDS.inc() raise except Exception: AUTH_ERRS.inc()
  59. 174.
  60. 178.
  61. 179.