Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Get Instrumented: How Prometheus Can Unify Your Metrics

Get Instrumented: How Prometheus Can Unify Your Metrics

Hynek Schlawack

May 31, 2016
Tweet

More Decks by Hynek Schlawack

Other Decks in Programming

Transcript

  1. Hynek Schlawack
    Get Instrumented How Prometheus Can
    Unify Your Metrics

    View full-size slide

  2. Service Level

    View full-size slide

  3. Service Level Indicator

    View full-size slide

  4. Service Level Indicator
    Objective

    View full-size slide

  5. Service Level Indicator
    Objective
    (Agreement)

    View full-size slide

  6. Metrics
    avg latency 0.3 0.5 0.8 1.1 2.6

    View full-size slide

  7. Metrics
    12:00 12:01 12:02 12:03 12:04
    avg latency 0.3 0.5 0.8 1.1 2.6

    View full-size slide

  8. Metrics
    12:00 12:01 12:02 12:03 12:04
    avg latency 0.3 0.5 0.8 1.1 2.6
    server load 0.3 1.0 2.3 3.5 5.2

    View full-size slide

  9. Metric Types

    View full-size slide

  10. Metric Types
    ❖counter

    View full-size slide

  11. Metric Types
    ❖counter
    ❖gauge

    View full-size slide

  12. Metric Types
    ❖counter
    ❖gauge
    ❖summary

    View full-size slide

  13. Metric Types
    ❖counter
    ❖gauge
    ❖summary
    ❖histogram

    View full-size slide

  14. Metric Types
    ❖counter
    ❖gauge
    ❖summary
    ❖histogram
    ❖ buckets (1s,
    0.5s, 0.25, …)

    View full-size slide

  15. ❖ avg(request time) ≠ avg(UX)
    Averages

    View full-size slide

  16. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View full-size slide

  17. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View full-size slide

  18. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    Averages

    View full-size slide

  19. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    Averages

    View full-size slide

  20. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    Averages

    View full-size slide

  21. ❖ avg(request time) ≠ avg(UX)
    ❖ avg({1, 1, 1, 1, 10}) = 2.8
    ❖ median({1, 1, 1, 1, 10}) = 1
    ❖ median({1, 1, 100_000}) = 1
    Averages

    View full-size slide

  22. Percentiles
    nth percentile P of a data set
    =
    P ≥ n% of values

    View full-size slide

  23. 50th percentile = 1 ms

    View full-size slide

  24. 50th percentile = 1 ms
    50% of requests done by 1 ms

    View full-size slide

  25. Percentiles
    P {1, 1, 100_000}
    50th 1

    View full-size slide

  26. Percentiles
    P {1, 1, 100_000}
    50th 1
    95th 90_000

    View full-size slide

  27. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    View full-size slide

  28. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View full-size slide

  29. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View full-size slide

  30. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total

    View full-size slide

  31. Naming
    backend1_app_http_reqs_msgs_post
    backend1_app_http_reqs_msgs_get

    app_http_reqs_total{meth="POST", path="/msgs", backend="1"}
    app_http_reqs_total{meth="GET", path="/msgs", backend="1"}

    app_http_reqs_total

    View full-size slide

  32. 1. resolution = scraping interval

    View full-size slide

  33. 1. resolution = scraping interval
    2. missing scrapes = less resolution

    View full-size slide

  34. Pull: Problems
    ❖ short lived jobs

    View full-size slide

  35. Pull: Problems
    ❖ short lived jobs
    ❖ target discovery

    View full-size slide

  36. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View full-size slide

  37. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View full-size slide

  38. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'

    View full-size slide

  39. Configuration
    scrape_configs:
    - job_name: 'prometheus'
    target_groups:
    - targets:
    - 'localhost:9090'
    {instance="localhost:9090",job="prometheus"}

    View full-size slide

  40. Pull: Problems
    ❖ target discovery
    ❖ short lived jobs
    ❖ Heroku/NATed systems

    View full-size slide

  41. Pull: Advantages

    View full-size slide

  42. Pull: Advantages
    ❖ multiple Prometheis easy

    View full-size slide

  43. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection

    View full-size slide

  44. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection
    ❖ predictable, no self-DoS

    View full-size slide

  45. Pull: Advantages
    ❖ multiple Prometheis easy
    ❖ outage detection
    ❖ predictable, no self-DoS
    ❖ easy to instrument 3rd parties

    View full-size slide

  46. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View full-size slide

  47. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View full-size slide

  48. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View full-size slide

  49. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View full-size slide

  50. Metrics Format
    # HELP req_seconds Time spent \
    processing a request in seconds.
    # TYPE req_seconds histogram
    req_seconds_count 390.0
    req_seconds_sum 177.0319407

    View full-size slide

  51. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View full-size slide

  52. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View full-size slide

  53. Percentiles
    req_seconds_bucket{le="0.05"} 0.0
    req_seconds_bucket{le="0.25"} 1.0
    req_seconds_bucket{le="0.5"} 273.0
    req_seconds_bucket{le="0.75"} 369.0
    req_seconds_bucket{le="1.0"} 388.0
    req_seconds_bucket{le="2.0"} 390.0
    req_seconds_bucket{le="+Inf"} 390.0

    View full-size slide

  54. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View full-size slide

  55. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View full-size slide

  56. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View full-size slide

  57. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    )

    View full-size slide

  58. Aggregation
    sum(
    rate(
    req_seconds_count{dc="west"}[1m]
    )
    )

    View full-size slide

  59. Aggregation
    sum(
    rate(
    req_seconds_count[1m]
    )
    ) by (dc)

    View full-size slide

  60. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View full-size slide

  61. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View full-size slide

  62. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View full-size slide

  63. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View full-size slide

  64. Percentiles
    histogram_quantile(
    0.9, rate(
    req_seconds_bucket[10m]
    ))

    View full-size slide

  65. Internal
    ❖ great for ad-hoc

    View full-size slide

  66. Internal
    ❖ great for ad-hoc
    ❖ 1 expr per graph

    View full-size slide

  67. Internal
    ❖ great for ad-hoc
    ❖ 1 expr per graph
    ❖ templating

    View full-size slide

  68. PromDash
    ❖ best integration

    View full-size slide

  69. PromDash
    ❖ best integration
    ❖ former official

    View full-size slide

  70. PromDash
    ❖ best integration
    ❖ former official
    ❖ now deprecated
    ❖ don’t bother

    View full-size slide

  71. Grafana
    ❖ pretty & powerful

    View full-size slide

  72. Grafana
    ❖ pretty & powerful
    ❖ many integrations

    View full-size slide

  73. Grafana
    ❖ pretty & powerful
    ❖ many integrations
    ❖ mix and match!

    View full-size slide

  74. Grafana
    ❖ pretty & powerful
    ❖ many integrations
    ❖ mix and match!
    ❖ use this!

    View full-size slide

  75. Alerts & Scrying

    View full-size slide

  76. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  77. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  78. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  79. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  80. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  81. Alerts & Scrying
    ALERT DiskWillFillIn4Hours
    IF predict_linear(
    node_filesystem_free[1h], 4*3600) < 0
    FOR 5m

    View full-size slide

  82. Apache
    nginx
    Django
    PostgreSQL
    MySQL
    MongoDB
    CouchDB
    redis
    Varnish
    etcd
    Kubernetes
    Consul
    collectd
    HAProxy
    statsd
    graphite
    InfluxDB
    SNMP

    View full-size slide

  83. Apache
    nginx
    Django
    PostgreSQL
    MySQL
    MongoDB
    CouchDB
    redis
    Varnish
    etcd
    Kubernetes
    Consul
    collectd
    HAProxy
    statsd
    graphite
    InfluxDB
    SNMP

    View full-size slide

  84. node_exporter

    View full-size slide

  85. node_exporter
    cAdvisor

    View full-size slide

  86. System Insight

    View full-size slide

  87. System Insight
    ❖ load

    View full-size slide

  88. System Insight
    ❖ load
    ❖ procs

    View full-size slide

  89. System Insight
    ❖ load
    ❖ procs
    ❖ memory

    View full-size slide

  90. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network

    View full-size slide

  91. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network
    ❖ disk

    View full-size slide

  92. System Insight
    ❖ load
    ❖ procs
    ❖ memory
    ❖ network
    ❖ disk
    ❖ I/O

    View full-size slide

  93. mtail
    ❖ follow (log) files

    View full-size slide

  94. mtail
    ❖ follow (log) files
    ❖ extract metrics using regex

    View full-size slide

  95. mtail
    ❖ follow (log) files
    ❖ extract metrics using regex
    ❖ can be better than direct

    View full-size slide

  96. Moar
    ❖ Edges: web servers/HAProxy

    View full-size slide

  97. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box

    View full-size slide

  98. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box
    ❖ databases

    View full-size slide

  99. Moar
    ❖ Edges: web servers/HAProxy
    ❖ black box
    ❖ databases
    ❖ network

    View full-size slide

  100. So Far
    ❖ system stats

    View full-size slide

  101. So Far
    ❖ system stats
    ❖ outside look

    View full-size slide

  102. So Far
    ❖ system stats
    ❖ outside look
    ❖ 3rd party components

    View full-size slide

  103. cat-or.not
    ❖ HTTP service

    View full-size slide

  104. cat-or.not
    ❖ HTTP service
    ❖ upload picture

    View full-size slide

  105. cat-or.not
    ❖ HTTP service
    ❖ upload picture
    ❖ meow!/nope
    meow!

    View full-size slide

  106. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View full-size slide

  107. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View full-size slide

  108. from flask import Flask, g, request
    from cat_or_not import is_cat
    app = Flask(__name__)
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    return ("meow!"
    if is_cat(request.files["pic"])
    else "nope!")
    if __name__ == "__main__":
    app.run()

    View full-size slide

  109. pip install prometheus_client

    View full-size slide

  110. from prometheus_client import \
    start_http_server
    # …
    if __name__ == "__main__":
    start_http_server(8000)
    app.run()

    View full-size slide

  111. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  112. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  113. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  114. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  115. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  116. process_virtual_memory_bytes 156393472.0
    process_resident_memory_bytes 20480000.0
    process_start_time_seconds 1460214325.21
    process_cpu_seconds_total 0.169999999998
    process_open_fds 8.0
    process_max_fds 1024.0

    View full-size slide

  117. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")

    View full-size slide

  118. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")
    ANALYZE_TIME = Histogram(
    "cat_or_not_analyze_seconds",
    "Time spent analyzing pictures.")

    View full-size slide

  119. from prometheus_client import \
    Histogram, Gauge
    REQUEST_TIME = Histogram(
    "cat_or_not_request_seconds",
    "Time spent in HTTP requests.")
    ANALYZE_TIME = Histogram(
    "cat_or_not_analyze_seconds",
    "Time spent analyzing pictures.")
    IN_PROGRESS = Gauge(
    "cat_or_not_in_progress_requests",
    "Number of requests in progress.")

    View full-size slide

  120. @IN_PROGRESS.track_inprogress()
    @REQUEST_TIME.time()
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View full-size slide

  121. @IN_PROGRESS.track_inprogress()
    @REQUEST_TIME.time()
    @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View full-size slide

  122. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View full-size slide

  123. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View full-size slide

  124. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View full-size slide

  125. AUTH_TIME = Histogram("auth_seconds",
    "Time spent authenticating.")
    AUTH_ERRS = Counter("auth_errors_total",
    "Errors while authing.")
    AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
    "Wrong credentials.")
    class Auth:
    # ...
    @AUTH_TIME.time()
    def auth(self, request):
    while True:
    try:
    return self._auth(request)
    except WrongCredsError:
    AUTH_WRONG_CREDS.inc()
    raise
    except Exception:
    AUTH_ERRS.inc()

    View full-size slide

  126. @app.route("/analyze", methods=["POST"])
    def analyze():
    g.auth.check(request)
    with ANALYZE_TIME.time():
    result = is_cat(
    request.files["pic"].stream)
    return "meow!" if result else "nope!"

    View full-size slide

  127. pip install prometheus_async

    View full-size slide

  128. Wrapper
    from prometheus_async.aio import time
    @time(REQUEST_TIME)
    async def view(request):
    # ...

    View full-size slide

  129. Goodies
    ❖ aiohttp-based metrics export

    View full-size slide

  130. Goodies
    ❖ aiohttp-based metrics export
    ❖ also in thread!

    View full-size slide

  131. Goodies
    ❖ aiohttp-based metrics export
    ❖ also in thread!
    ❖ Consul Agent integration

    View full-size slide

  132. Wrap Up
    ✓ ✓

    View full-size slide

  133. Wrap Up
    ✓ ✓

    View full-size slide

  134. ox.cx/p
    @hynek
    vrmd.de

    View full-size slide