Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get Instrumented: How Prometheus Can Unify Your Metrics

Get Instrumented: How Prometheus Can Unify Your Metrics

Hynek Schlawack

May 31, 2016
Tweet

More Decks by Hynek Schlawack

Other Decks in Programming

Transcript

  1. Hynek Schlawack
    Get Instrumented How Prometheus Can
    Unify Your Metrics

    View Slide

  2. Goals

    View Slide

  3. Goals

    View Slide

  4. Goals

    View Slide

  5. Goals

    View Slide

  6. Goals

    View Slide

  7. Service Level

    View Slide

  8. Service Level Indicator

    View Slide

  9. Service Level Indicator
    Objective

    View Slide

  10. Service Level Indicator
    Objective
    (Agreement)

    View Slide

  11. Metrics

    View Slide

  12. Metrics
    avg latency 0.3 0.5 0.8 1.1 2.6

    View Slide

  13. Metrics
    12:00 12:01 12:02 12:03 12:04
    avg latency 0.3 0.5 0.8 1.1 2.6

    View Slide

  14. Metrics
    12:00 12:01 12:02 12:03 12:04
    avg latency 0.3 0.5 0.8 1.1 2.6
    server load 0.3 1.0 2.3 3.5 5.2

    View Slide

  15. View Slide

  16. Instrument

    View Slide

  17. Instrument

    View Slide

  18. Instrument

    View Slide

  19. Instrument

    View Slide

  20. Instrument

    View Slide

  21. View Slide

  22. View Slide

  23. Metric Types

    View Slide

  24. Metric Types
    ❖counter

    View Slide

  25. Metric Types
    ❖counter
    ❖gauge

    View Slide

  26. Metric Types
    ❖counter
    ❖gauge
    ❖summary

    View Slide

  27. Metric Types
    ❖counter
    ❖gauge
    ❖summary
    ❖histogram

    View Slide

  28. Metric Types
    ❖counter
    ❖gauge
    ❖summary
    ❖histogram
    ❖ buckets (1s,
    0.5s, 0.25, …)

    View Slide

  29. Averages

    View Slide

  30. ❖ avg(request time) ≠ avg(UX)
    Averages

    View Slide

  31. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View Slide

  32. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View Slide

  33. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View Slide

  34. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    Averages

    View Slide

  35. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    Averages

    View Slide

  36. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    ❖ median({1, 1, 100_000}) = 1
    Averages

    View Slide

  37. Percentiles

    View Slide

  38. Percentiles
    nth percentile P of a data set
    =
    P ≥ n% of values

    View Slide

  39. View Slide

  40. 50th percentile = 1 ms

    View Slide

  41. 50th percentile = 1 ms
    50% of requests done by 1 ms

    View Slide

  42. Percentiles

    View Slide

  43. Percentiles
    P {1, 1, 100_000}
    50th 1

    View Slide

  44. Percentiles
    P {1, 1, 100_000}
    50th 1
    95th 90_000

    View Slide

  45. View Slide

  46. View Slide

  47. View Slide

  48. Naming

    View Slide

  49. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    View Slide

  50. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View Slide

  51. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View Slide

  52. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View Slide

  53. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total{meth="POST", path="/msgs", backend="1"}
    app_http_reqs_total{meth="GET", path="/msgs", backend="1"}

    app_http_reqs_total

    View Slide

  54. View Slide

  55. View Slide

  56. 1. resolution = scraping interval

    View Slide

  57. 1. resolution = scraping interval
    2. missing scrapes = less resolution

    View Slide

  58. Pull: Problems
    ❖ short lived jobs

    View Slide

  59. View Slide

  60. Pull: Problems
    ❖ short lived jobs
    ❖ target discovery

    View Slide

  61. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View Slide

  62. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View Slide

  63. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View Slide

  64. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'
    {instance="localhost:9090",job="prometheus"}

    View Slide

  65. View Slide

  66. Pull: Problems
    ❖ target discovery
    ❖ short lived jobs
    ❖ Heroku/NATed systems

    View Slide

  67. Pull: Advantages

    View Slide

  68. Pull: Advantages
    ❖ multiple Prometheis easy

    View Slide

  69. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection

    View Slide

  70. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection
    ❖ predictable, no self-DoS

    View Slide

  71. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection
    ❖ predictable, no self-DoS
    ❖ easy to instrument 3rd parties

    View Slide

  72. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View Slide

  73. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View Slide

  74. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View Slide

  75. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View Slide

  76. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View Slide

  77. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View Slide

  78. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View Slide

  79. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View Slide

  80. View Slide

  81. Aggregation

    View Slide

  82. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View Slide

  83. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View Slide

  84. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View Slide

  85. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View Slide

  86. Aggregation
    sum(
    rate(
    req_seconds_count{dc="west"}[1m]
    )
    )

    View Slide

  87. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    ) by (dc)

    View Slide

  88. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View Slide

  89. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View Slide

  90. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View Slide

  91. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View Slide

  92. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View Slide

  93. View Slide

  94. View Slide

  95. Internal

    View Slide

  96. Internal
    ❖ great for ad-hoc

    View Slide

  97. Internal
    ❖ great for ad-hoc
    ❖ 1 expr per graph

    View Slide

  98. Internal
    ❖ great for ad-hoc
    ❖ 1 expr per graph
    ❖ templating

    View Slide

  99. PromDash

    View Slide

  100. PromDash
    ❖ best integration

    View Slide

  101. PromDash
    ❖ best integration
    ❖ former official

    View Slide

  102. PromDash
    ❖ best integration
    ❖ former official
    ❖ now deprecated
    ❖ don’t bother

    View Slide

  103. Grafana

    View Slide

  104. Grafana
    ❖ pretty & powerful

    View Slide

  105. Grafana
    ❖ pretty & powerful
    ❖ many integrations

    View Slide

  106. Grafana
    ❖ pretty & powerful
    ❖ many integrations
    ❖ mix and match!

    View Slide

  107. Grafana
    ❖ pretty & powerful
    ❖ many integrations
    ❖ mix and match!
    ❖ use this!

    View Slide

  108. View Slide

  109. Alerts & Scrying

    View Slide

  110. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  111. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  112. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  113. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  114. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  115. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View Slide

  116. View Slide

  117. View Slide

  118. View Slide

  119. Environment

    View Slide

  120. View Slide

  121. Apache
    nginx
    Django
    PostgreSQL
    MySQL
    MongoDB
    CouchDB
    redis
    Varnish
    etcd
    Kubernetes
    Consul
    collectd
    HAProxy
    statsd
    graphite
    InfluxDB
    SNMP

    View Slide

  122. Apache
    nginx
    Django
    PostgreSQL
    MySQL
    MongoDB
    CouchDB
    redis
    Varnish
    etcd
    Kubernetes
    Consul
    collectd
    HAProxy
    statsd
    graphite
    InfluxDB
    SNMP

    View Slide

  123. node_exporter

    View Slide

  124. node_exporter
    cAdvisor

    View Slide

  125. System Insight

    View Slide

  126. System Insight
    ❖ load

    View Slide

  127. System Insight
    ❖ load
    ❖ procs

    View Slide

  128. System Insight
    ❖ load
    ❖ procs
    ❖ memory

    View Slide

  129. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network

    View Slide

  130. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network
    ❖ disk

    View Slide

  131. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network
    ❖ disk
    ❖ I/O

    View Slide

  132. mtail

    View Slide

  133. mtail
    ❖ follow (log) files

    View Slide

  134. mtail
    ❖ follow (log) files
    ❖ extract metrics using regex

    View Slide

  135. mtail
    ❖ follow (log) files
    ❖ extract metrics using regex
    ❖ can be better than direct

    View Slide

  136. Moar

    View Slide

  137. Moar
    ❖ Edges: web servers/HAProxy

    View Slide

  138. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box

    View Slide

  139. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box
    ❖ databases

    View Slide

  140. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box
    ❖ databases
    ❖ network

    View Slide

  141. So Far

    View Slide

  142. So Far
    ❖ system stats

    View Slide

  143. So Far
    ❖ system stats
    ❖ outside look

    View Slide

  144. So Far
    ❖ system stats
    ❖ outside look
    ❖ 3rd party components

    View Slide

  145. Code

    View Slide

  146. cat-or.not

    View Slide

  147. cat-or.not
    ❖ HTTP service

    View Slide

  148. cat-or.not
    ❖ HTTP service
    ❖ upload picture

    View Slide

  149. cat-or.not
    ❖ HTTP service
    ❖ upload picture
    ❖ meow!/nope
    meow!

    View Slide

  150. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View Slide

  151. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View Slide

  152. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View Slide

  153. pip install prometheus_client

    View Slide

  154. from prometheus_client import \
    start_http_server
    # …
    if __name__ == "__main__":
    start_http_server(8000)
    app.run()

    View Slide

  155. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  156. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  157. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  158. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  159. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  160. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View Slide

  161. View Slide

  162. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")

    View Slide

  163. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")
    ANALYZE_TIME = Histogram(
    "cat_or_not_analyze_seconds",
    "Time spent analyzing pictures.")

    View Slide

  164. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")
    ANALYZE_TIME = Histogram(
    "cat_or_not_analyze_seconds",
    "Time spent analyzing pictures.")
    IN_PROGRESS = Gauge(
    "cat_or_not_in_progress_requests",
    "Number of requests in progress.")

    View Slide

  165. @IN_PROGRESS.track_inprogress()
    @REQUEST_TIME.time()
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View Slide

  166. @IN_PROGRESS.track_inprogress()
    @REQUEST_TIME.time()
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View Slide

  167. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View Slide

  168. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View Slide

  169. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View Slide

  170. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View Slide

  171. @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View Slide

  172. pip install prometheus_async

    View Slide

  173. Wrapper
    from prometheus_async.aio import time
    @time(REQUEST_TIME)
    async def view(request):
    # ...

    View Slide

  174. Goodies

    View Slide

  175. Goodies
    ❖ aiohttp-based metrics export

    View Slide

  176. Goodies
    ❖ aiohttp-based metrics export
    ❖ also in thread!

    View Slide

  177. Goodies
    ❖ aiohttp-based metrics export
    ❖ also in thread!
    ❖ Consul Agent integration

    View Slide

  178. Wrap Up

    View Slide

  179. Wrap Up

    View Slide

  180. Wrap Up

    View Slide

  181. Wrap Up
    ✓ ✓

    View Slide

  182. Wrap Up
    ✓ ✓

    View Slide

  183. ox.cx/p
    @hynek
    vrmd.de

    View Slide