Hynek Schlawack
May 31, 2016
11k

# Get Instrumented: How Prometheus Can Unify Your Metrics

May 31, 2016

## Transcript

1. Hynek Schlawack
Get Instrumented How Prometheus Can

2. Goals

3. Goals

4. Goals

5. Goals

6. Goals

7. Service Level

8. Service Level Indicator

9. Service Level Indicator
Objective

10. Service Level Indicator
Objective
(Agreement)

11. Metrics

12. Metrics
avg latency 0.3 0.5 0.8 1.1 2.6

13. Metrics
12:00 12:01 12:02 12:03 12:04
avg latency 0.3 0.5 0.8 1.1 2.6

14. Metrics
12:00 12:01 12:02 12:03 12:04
avg latency 0.3 0.5 0.8 1.1 2.6
server load 0.3 1.0 2.3 3.5 5.2

15. Instrument

16. Instrument

17. Instrument

18. Instrument

19. Instrument

20. Metric Types

21. Metric Types
❖counter

22. Metric Types
❖counter
❖gauge

23. Metric Types
❖counter
❖gauge
❖summary

24. Metric Types
❖counter
❖gauge
❖summary
❖histogram

25. Metric Types
❖counter
❖gauge
❖summary
❖histogram
❖ buckets (1s,
0.5s, 0.25, …)

26. Averages

27. ❖ avg(request time) ≠ avg(UX)
Averages

28. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
Averages

29. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
Averages

30. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
Averages

31. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
❖ median({1, 1, 1, 1, 10}) = 1
Averages

32. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
❖ median({1, 1, 1, 1, 10}) = 1
Averages

33. ❖ avg(request time) ≠ avg(UX)
❖ avg({1, 1, 1, 1, 10}) = 2.8
❖ median({1, 1, 1, 1, 10}) = 1
❖ median({1, 1, 100_000}) = 1
Averages

34. Percentiles

35. Percentiles
nth percentile P of a data set
=
P ≥ n% of values

36. 50th percentile = 1 ms

37. 50th percentile = 1 ms
50% of requests done by 1 ms

38. Percentiles

39. Percentiles
P {1, 1, 100_000}
50th 1

40. Percentiles
P {1, 1, 100_000}
50th 1
95th 90_000

41. Naming

42. Naming
backend1_app_http_reqs_msgs_post
backend1_app_http_reqs_msgs_get

43. Naming
backend1_app_http_reqs_msgs_post
backend1_app_http_reqs_msgs_get

app_http_reqs_total

44. Naming
backend1_app_http_reqs_msgs_post
backend1_app_http_reqs_msgs_get

app_http_reqs_total

45. Naming
backend1_app_http_reqs_msgs_post
backend1_app_http_reqs_msgs_get

app_http_reqs_total

46. Naming
backend1_app_http_reqs_msgs_post
backend1_app_http_reqs_msgs_get

app_http_reqs_total{meth="POST", path="/msgs", backend="1"}
app_http_reqs_total{meth="GET", path="/msgs", backend="1"}

app_http_reqs_total

47. 1. resolution = scraping interval

48. 1. resolution = scraping interval
2. missing scrapes = less resolution

49. Pull: Problems
❖ short lived jobs

50. Pull: Problems
❖ short lived jobs
❖ target discovery

51. Configuration
scrape_configs:
- job_name: 'prometheus'
target_groups:
- targets:
- 'localhost:9090'

52. Configuration
scrape_configs:
- job_name: 'prometheus'
target_groups:
- targets:
- 'localhost:9090'

53. Configuration
scrape_configs:
- job_name: 'prometheus'
target_groups:
- targets:
- 'localhost:9090'

54. Configuration
scrape_configs:
- job_name: 'prometheus'
target_groups:
- targets:
- 'localhost:9090'
{instance="localhost:9090",job="prometheus"}

55. Pull: Problems
❖ target discovery
❖ short lived jobs
❖ Heroku/NATed systems

❖ multiple Prometheis easy

❖ multiple Prometheis easy
❖ outage detection

❖ multiple Prometheis easy
❖ outage detection
❖ predictable, no self-DoS

❖ multiple Prometheis easy
❖ outage detection
❖ predictable, no self-DoS
❖ easy to instrument 3rd parties

61. Metrics Format
# HELP req_seconds Time spent \
processing a request in seconds.
# TYPE req_seconds histogram
req_seconds_count 390.0
req_seconds_sum 177.0319407

62. Metrics Format
# HELP req_seconds Time spent \
processing a request in seconds.
# TYPE req_seconds histogram
req_seconds_count 390.0
req_seconds_sum 177.0319407

63. Metrics Format
# HELP req_seconds Time spent \
processing a request in seconds.
# TYPE req_seconds histogram
req_seconds_count 390.0
req_seconds_sum 177.0319407

64. Metrics Format
# HELP req_seconds Time spent \
processing a request in seconds.
# TYPE req_seconds histogram
req_seconds_count 390.0
req_seconds_sum 177.0319407

65. Metrics Format
# HELP req_seconds Time spent \
processing a request in seconds.
# TYPE req_seconds histogram
req_seconds_count 390.0
req_seconds_sum 177.0319407

66. Percentiles
req_seconds_bucket{le="0.05"} 0.0
req_seconds_bucket{le="0.25"} 1.0
req_seconds_bucket{le="0.5"} 273.0
req_seconds_bucket{le="0.75"} 369.0
req_seconds_bucket{le="1.0"} 388.0
req_seconds_bucket{le="2.0"} 390.0
req_seconds_bucket{le="+Inf"} 390.0

67. Percentiles
req_seconds_bucket{le="0.05"} 0.0
req_seconds_bucket{le="0.25"} 1.0
req_seconds_bucket{le="0.5"} 273.0
req_seconds_bucket{le="0.75"} 369.0
req_seconds_bucket{le="1.0"} 388.0
req_seconds_bucket{le="2.0"} 390.0
req_seconds_bucket{le="+Inf"} 390.0

68. Percentiles
req_seconds_bucket{le="0.05"} 0.0
req_seconds_bucket{le="0.25"} 1.0
req_seconds_bucket{le="0.5"} 273.0
req_seconds_bucket{le="0.75"} 369.0
req_seconds_bucket{le="1.0"} 388.0
req_seconds_bucket{le="2.0"} 390.0
req_seconds_bucket{le="+Inf"} 390.0

69. Aggregation

70. Aggregation
sum(
rate(
req_seconds_count[1m]
)
)

71. Aggregation
sum(
rate(
req_seconds_count[1m]
)
)

72. Aggregation
sum(
rate(
req_seconds_count[1m]
)
)

73. Aggregation
sum(
rate(
req_seconds_count[1m]
)
)

74. Aggregation
sum(
rate(
req_seconds_count{dc="west"}[1m]
)
)

75. Aggregation
sum(
rate(
req_seconds_count[1m]
)
) by (dc)

76. Percentiles
histogram_quantile(
0.9, rate(
req_seconds_bucket[10m]
))

77. Percentiles
histogram_quantile(
0.9, rate(
req_seconds_bucket[10m]
))

78. Percentiles
histogram_quantile(
0.9, rate(
req_seconds_bucket[10m]
))

79. Percentiles
histogram_quantile(
0.9, rate(
req_seconds_bucket[10m]
))

80. Percentiles
histogram_quantile(
0.9, rate(
req_seconds_bucket[10m]
))

81. Internal

82. Internal

83. Internal
❖ 1 expr per graph

84. Internal
❖ 1 expr per graph
❖ templating

85. PromDash

86. PromDash
❖ best integration

87. PromDash
❖ best integration
❖ former ofﬁcial

88. PromDash
❖ best integration
❖ former ofﬁcial
❖ now deprecated
❖ don’t bother

89. Grafana

90. Grafana
❖ pretty & powerful

91. Grafana
❖ pretty & powerful
❖ many integrations

92. Grafana
❖ pretty & powerful
❖ many integrations
❖ mix and match!

93. Grafana
❖ pretty & powerful
❖ many integrations
❖ mix and match!
❖ use this!

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

IF predict_linear(
node_filesystem_free[1h], 4*3600) < 0
FOR 5m

101. Environment

102. Apache
nginx
Django
PostgreSQL
MySQL
MongoDB
CouchDB
redis
Varnish
etcd
Kubernetes
Consul
collectd
HAProxy
statsd
graphite
InﬂuxDB
SNMP

103. Apache
nginx
Django
PostgreSQL
MySQL
MongoDB
CouchDB
redis
Varnish
etcd
Kubernetes
Consul
collectd
HAProxy
statsd
graphite
InﬂuxDB
SNMP

104. node_exporter

105. node_exporter

106. System Insight

107. System Insight

108. System Insight
❖ procs

109. System Insight
❖ procs
❖ memory

110. System Insight
❖ procs
❖ memory
❖ network

111. System Insight
❖ procs
❖ memory
❖ network
❖ disk

112. System Insight
❖ procs
❖ memory
❖ network
❖ disk
❖ I/O

113. mtail

114. mtail

115. mtail
❖ extract metrics using regex

116. mtail
❖ extract metrics using regex
❖ can be better than direct

117. Moar

118. Moar
❖ Edges: web servers/HAProxy

119. Moar
❖ Edges: web servers/HAProxy
❖ black box

120. Moar
❖ Edges: web servers/HAProxy
❖ black box
❖ databases

121. Moar
❖ Edges: web servers/HAProxy
❖ black box
❖ databases
❖ network

122. So Far

123. So Far
❖ system stats

124. So Far
❖ system stats
❖ outside look

125. So Far
❖ system stats
❖ outside look
❖ 3rd party components

126. Code

127. cat-or.not

128. cat-or.not
❖ HTTP service

129. cat-or.not
❖ HTTP service

130. cat-or.not
❖ HTTP service
❖ meow!/nope
meow!

from cat_or_not import is_cat
@app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
return ("meow!"
if is_cat(request.files["pic"])
else "nope!")
if __name__ == "__main__":
app.run()

from cat_or_not import is_cat
@app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
return ("meow!"
if is_cat(request.files["pic"])
else "nope!")
if __name__ == "__main__":
app.run()

from cat_or_not import is_cat
@app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
return ("meow!"
if is_cat(request.files["pic"])
else "nope!")
if __name__ == "__main__":
app.run()

134. pip install prometheus_client

135. from prometheus_client import \
start_http_server
# …
if __name__ == "__main__":
start_http_server(8000)
app.run()

136. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

137. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

138. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

139. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

140. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

141. process_virtual_memory_bytes 156393472.0
process_resident_memory_bytes 20480000.0
process_start_time_seconds 1460214325.21
process_cpu_seconds_total 0.169999999998
process_open_fds 8.0
process_max_fds 1024.0

142. from prometheus_client import \
Histogram, Gauge
REQUEST_TIME = Histogram(
"cat_or_not_request_seconds",
"Time spent in HTTP requests.")

143. from prometheus_client import \
Histogram, Gauge
REQUEST_TIME = Histogram(
"cat_or_not_request_seconds",
"Time spent in HTTP requests.")
ANALYZE_TIME = Histogram(
"cat_or_not_analyze_seconds",
"Time spent analyzing pictures.")

144. from prometheus_client import \
Histogram, Gauge
REQUEST_TIME = Histogram(
"cat_or_not_request_seconds",
"Time spent in HTTP requests.")
ANALYZE_TIME = Histogram(
"cat_or_not_analyze_seconds",
"Time spent analyzing pictures.")
IN_PROGRESS = Gauge(
"cat_or_not_in_progress_requests",
"Number of requests in progress.")

145. @IN_PROGRESS.track_inprogress()
@REQUEST_TIME.time()
@app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
with ANALYZE_TIME.time():
result = is_cat(
request.files["pic"].stream)
return "meow!" if result else "nope!"

146. @IN_PROGRESS.track_inprogress()
@REQUEST_TIME.time()
@app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
with ANALYZE_TIME.time():
result = is_cat(
request.files["pic"].stream)
return "meow!" if result else "nope!"

147. AUTH_TIME = Histogram("auth_seconds",
"Time spent authenticating.")
AUTH_ERRS = Counter("auth_errors_total",
"Errors while authing.")
AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
"Wrong credentials.")
class Auth:
# ...
@AUTH_TIME.time()
def auth(self, request):
while True:
try:
return self._auth(request)
except WrongCredsError:
AUTH_WRONG_CREDS.inc()
raise
except Exception:
AUTH_ERRS.inc()

148. AUTH_TIME = Histogram("auth_seconds",
"Time spent authenticating.")
AUTH_ERRS = Counter("auth_errors_total",
"Errors while authing.")
AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
"Wrong credentials.")
class Auth:
# ...
@AUTH_TIME.time()
def auth(self, request):
while True:
try:
return self._auth(request)
except WrongCredsError:
AUTH_WRONG_CREDS.inc()
raise
except Exception:
AUTH_ERRS.inc()

149. AUTH_TIME = Histogram("auth_seconds",
"Time spent authenticating.")
AUTH_ERRS = Counter("auth_errors_total",
"Errors while authing.")
AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
"Wrong credentials.")
class Auth:
# ...
@AUTH_TIME.time()
def auth(self, request):
while True:
try:
return self._auth(request)
except WrongCredsError:
AUTH_WRONG_CREDS.inc()
raise
except Exception:
AUTH_ERRS.inc()

150. AUTH_TIME = Histogram("auth_seconds",
"Time spent authenticating.")
AUTH_ERRS = Counter("auth_errors_total",
"Errors while authing.")
AUTH_WRONG_CREDS = Counter("auth_wrong_creds_total",
"Wrong credentials.")
class Auth:
# ...
@AUTH_TIME.time()
def auth(self, request):
while True:
try:
return self._auth(request)
except WrongCredsError:
AUTH_WRONG_CREDS.inc()
raise
except Exception:
AUTH_ERRS.inc()

151. @app.route("/analyze", methods=["POST"])
def analyze():
g.auth.check(request)
with ANALYZE_TIME.time():
result = is_cat(
request.files["pic"].stream)
return "meow!" if result else "nope!"

152. pip install prometheus_async

153. Wrapper
from prometheus_async.aio import time
@time(REQUEST_TIME)
async def view(request):
# ...

154. Goodies

155. Goodies
❖ aiohttp-based metrics export

156. Goodies
❖ aiohttp-based metrics export

157. Goodies
❖ aiohttp-based metrics export
❖ Consul Agent integration

158. Wrap Up

159. Wrap Up

160. Wrap Up

161. Wrap Up
✓ ✓

162. Wrap Up
✓ ✓

163. ox.cx/p
@hynek
vrmd.de