Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus - A Whirlwind Tour

Prometheus - A Whirlwind Tour

A presentation on Prometheus at OSCON 2017.

1a73d340b0467da58dc2591b3a0edd5a?s=128

Cindy Sridharan

May 10, 2017
Tweet

Transcript

  1. Prometheus A Whirlwind Tour Cindy Sridharan Oscon 2017 Austin, Texas

  2. @copyconstruct @copyconstruct @copyconstruct

  3. The Future?

  4. None
  5. None
  6. None
  7. OBSERVABILITY > TESTING

  8. Things testing cannot detect

  9. elasticity of the production environment

  10. unpredictability of inputs

  11. the vagaries of upstream and downstream dependencies

  12. Cloud native architectures need best in class observability

  13. None
  14. We cannot understand software unless we observe it

  15. Debugging must be viewed as the process by which systems

    are understood and improved, not merely as the process by which bugs are made to go away! - Bryan Cantrill
  16. OBSERVABILITY must also be viewed as the process by which

    systems are understood and improved, not merely as the process by which bugs are made to go away!
  17. OBSERVABILITY cannot be an afterthought

  18. Instrumentation should be a requirement for a PR to be

    merged
  19. OBSERVABILITY needs to be a part of system design and

    development
  20. But … what even is “observability” ?

  21. There are three pillars that make up a modern Observability

    stack
  22. Logging Tracing Metrics

  23. All three are examples of whitebox “monitoring”

  24. WHITEBOX Observability data gathered from the internals of the target

    system Is capable of providing warning about a problem before it occurs BLACKBOX Observes external functionality as observed by an end user of the system Helps detect when a problem is ongoing and contributing to external symptoms
  25. None
  26. Blackbox methods test your Service Level Objectives

  27. None
  28. Whitebox methods monitor your Service Level Agreements

  29. None
  30. Different systems have different blackbox monitoring and whitebox instrumentation requirements

    given their agreed upon SLO and SLA
  31. Where does Prometheus fit in here?

  32. None
  33. None
  34. Prometheus

  35. Whitebox monitoring toolkit and a TSDB for metrics

  36. Monitoring Toolkit

  37. Client Instrumentation Metrics Ingestion Metrics Processing and Storage Querying and

    Visualization Analysis Alerting
  38. Client instrumentation

  39. What even is a “metric”?

  40. A set of numbers that give information about a particular

    process or activity
  41. Metrics are usually measured over intervals of time — in other words,

    a time series
  42. None
  43. What metrics to collect?

  44. The Four Golden Signals Proposed by the SRE book

  45. Latency Traffic Errors Saturation Proposed by the SRE book

  46. USE method by Brendan Gregg

  47. Utilization average time the resource is busy servicing work Saturation

    degree to which resource has extra work which it can't service, often queued Errors count of error events B R E N D A N G R E G G
  48. RED method by Tom Wilkie

  49. How busy is my service? R equest rate Are there

    any errors in my service E rror rate What is the latency in my service D uration of requests T O M W I L K I E
  50. None
  51. Prometheus has stateful client libraries in all major languages

  52. Server is agnostic to the type of metric

  53. The Prometheus client libraries support four types of metrics

  54. Counters Gauges Histogram Summary

  55. “Target” discovery happens via service discovery

  56. None
  57. Metrics ingestion

  58. None
  59. Pull over HTTP

  60. Does Pull scale?

  61. Prometheus isn’t an event based system or Nagios that spawns

    a subprocess while “pulling”
  62. Pull lowers risk of DDoSing your monitoring system

  63. Pull based systems monitor if a service is down (if

    a scrape fails) as a part of gathering metrics
  64. None
  65. None
  66. With statsd type of systems, the application sends a UDP

    message for every event it observes
  67. Monitoring traffic increases proportionally to user traffic or whatever traffic

    is generating monitoring data
  68. Prometheus clients aggregate metrics in memory which is scraped by

    the Prometheus server upon regular intervals
  69. None
  70. If you want to push, there’s a PUSHGATEWAY for short

    lived jobs
  71. EXPORTERS

  72. Exporters help in exporting existing metrics from third-party systems as

    Prometheus metrics.
  73. JMX SNMP HAProxy MySQL Blackbox cAdvisor (Node) system metrics

  74. S T O R A G E

  75. Single node, no clustering

  76. For HA, run 2 identical Prometheus servers

  77. None
  78. In Prometheus, a time series has an ID and a

    sample
  79. None
  80. An ID is a combination of both the metric name

    and the labels associated
  81. A sample is a combination of a millisecond precision timestamp

    and a float64 value
  82. Requirements of *any* TSDB? Effective queries Effective writes

  83. Write optimized Requires parallel queries and aggregation for diverse query

    patterns during read time
  84. None
  85. None
  86. None
  87. None
  88. Write pattern is horizontal A TSDB ingests potentially several time

    series from every target at specific intervals of time
  89. None
  90. None
  91. None
  92. None
  93. Reads are random We read not entire rows or columns

    but sparse matrices
  94. Read optimized Write data in such a way that it

    is closely aligned for reads
  95. None
  96. None
  97. The time series are stored in a one file per

    time series format on disk
  98. None
  99. Incoming time series are stored in chunks in memory Chunks

    are flushed to disk when they are full
  100. None
  101. Incomplete chunks are checkpointed to disk so as to be

    able to recover after a crash
  102. None
  103. All data required to evaluate a PromQL expression needs to

    be in memory This data is also cached aggressively for future queries.
  104. None
  105. None
  106. None
  107. None
  108. Prometheus supports two types of rules which may be configured

    and then evaluated at regular intervals - Recording rules and Alerting rules.
  109. Same chunk eviction policy applies while evaluating for Alerting and

    Recording Rules
  110. RECORDING RULES Recording rules allow you to precompute frequently needed

    or computationally expensive expressions and save their result as a new set of time series
  111. RECORDING RULES Querying the precomputed result will then often be

    much faster than executing the original expression every time it is needed
  112. RECORDING RULES Come in handy while creating dashboards where the

    same expression is evaluated every time a dashboard is refreshed
  113. ALERTING RULES Allow defining alert conditions based on PromQL expressions

    and to send notifications about firing alerts to an external service.
  114. Drawbacks of V2 storage

  115. Single file per time series

  116. High resource utilization because of time-series churn

  117. Checkpointing to disk can be longer than acceptable

  118. Deletion of stale time-series is prohibitively expensive

  119. SQOF a ka Single Query of Failure

  120. None
  121. None
  122. None
  123. None
  124. None
  125. None
  126. None
  127. FEDERATION

  128. Federation allows a Prometheus server to scrape selected time series

    from another Prometheus server
  129. None
  130. CROSS-SERVICE FEDERATION

  131. A Prometheus server of one service is configured to scrape

    selected data from another service's Prometheus server to enable alerting and queries against both datasets within a single server
  132. None
  133. HIERARCHICAL FEDERATION

  134. The federation topology resembles a tree, with higher level Prometheus

    servers collecting aggregated time series data from a larger number of subordinated servers
  135. None
  136. REMOTE STORAGE

  137. None
  138. None
  139. None
  140. Weave Cortex (DynamoDB + S3) Chronix (Solr) Vulcan (Kafka +

    Cassandra)
  141. VISUALIZATION

  142. None
  143. ANALYSIS

  144. PromQL one of the defining features of Prometheus

  145. Labels > Hierarchy

  146. stats . timers . accounts . ios . http .

    post . authenticate . response_time . upper_95
  147. { resource=accounts, method=post, protocol=http, user_agent=ios, endpoint=/authenticate, name=response_time, }

  148. Better exploration because of dimensional queries

  149. PromQL rate(api_http_requests_total [5m] ) SQL SELECT job, instance, method, status,

    path, rate(value, 5m) FROM api_http_requests_total
  150. ALERTING

  151. No automatic anomaly detection

  152. ALERT <alert name> IF <expression> [ FOR <duration> ] [

    LABELS <label set> ] [ ANNOTATIONS <label set> ]
  153. None
  154. ALERT ConsulRaftPeersLow IF consul_raft_peers < 5 FOR 1m LABELS {severity="page”,

    team=“infra”} ANNOTATIONS {description="consul raft peer count low: {{$value}}", summary="consul raft peer count low: {{$value}}"}
  155. ALERT QueueCritical IF sum (broker_q{svc_pref="prod"}) > 5000 FOR 10m LABELS

    {severity="page", team=”product"} ANNOTATIONS {description="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long", summary="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long"}
  156. ALERTMANAGER

  157. Deduplication Grouping Routing Suppression of Alerts

  158. None
  159. CASE STUDY

  160. None
  161. None
  162. None
  163. 24 employees 8 engineers

  164. Requirements for a monitoring system?

  165. Ease of Use

  166. Ease of Operation

  167. Cost Effective!

  168. None
  169. None
  170. Cost Effective “at scale”

  171. Scale?

  172. imgix

  173. imgix

  174. imgix Our last outage when we were both shedding load

    and serving up errors
  175. None
  176. CONCLUSION

  177. None
  178. None
  179. Our stack is C, Lua, Go, Python

  180. Fantastic official Go and Python clients

  181. Custom LuaJIT client for counters, gauges and histograms

  182. None
  183. None
  184. Single statically linked Go binary

  185. No clustering No dependency on Zookeeper et al.

  186. ~2 years of Prometheus use in production

  187. None
  188. Only “cost” has been SSD upgrades on boxes

  189. None
  190. Let’s not answer that last question!

  191. Thank You! @copyconstruct