Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus - A Whirlwind Tour

Prometheus - A Whirlwind Tour

A presentation on Prometheus at OSCON 2017.

Cindy Sridharan

May 10, 2017

More Decks by Cindy Sridharan

Other Decks in Technology


  1. Debugging must be viewed as the process by which systems

    are understood and improved, not merely as the process by which bugs are made to go away! - Bryan Cantrill
  2. OBSERVABILITY must also be viewed as the process by which

    systems are understood and improved, not merely as the process by which bugs are made to go away!
  3. WHITEBOX Observability data gathered from the internals of the target

    system Is capable of providing warning about a problem before it occurs BLACKBOX Observes external functionality as observed by an end user of the system Helps detect when a problem is ongoing and contributing to external symptoms
  4. Utilization average time the resource is busy servicing work Saturation

    degree to which resource has extra work which it can't service, often queued Errors count of error events B R E N D A N G R E G G
  5. How busy is my service? R equest rate Are there

    any errors in my service E rror rate What is the latency in my service D uration of requests T O M W I L K I E
  6. Pull based systems monitor if a service is down (if

    a scrape fails) as a part of gathering metrics
  7. With statsd type of systems, the application sends a UDP

    message for every event it observes
  8. Prometheus clients aggregate metrics in memory which is scraped by

    the Prometheus server upon regular intervals
  9. Write pattern is horizontal A TSDB ingests potentially several time

    series from every target at specific intervals of time
  10. Incoming time series are stored in chunks in memory Chunks

    are flushed to disk when they are full
  11. All data required to evaluate a PromQL expression needs to

    be in memory This data is also cached aggressively for future queries.
  12. Prometheus supports two types of rules which may be configured

    and then evaluated at regular intervals - Recording rules and Alerting rules.
  13. RECORDING RULES Recording rules allow you to precompute frequently needed

    or computationally expensive expressions and save their result as a new set of time series
  14. RECORDING RULES Querying the precomputed result will then often be

    much faster than executing the original expression every time it is needed
  15. RECORDING RULES Come in handy while creating dashboards where the

    same expression is evaluated every time a dashboard is refreshed
  16. ALERTING RULES Allow defining alert conditions based on PromQL expressions

    and to send notifications about firing alerts to an external service.
  17. A Prometheus server of one service is configured to scrape

    selected data from another service's Prometheus server to enable alerting and queries against both datasets within a single server
  18. The federation topology resembles a tree, with higher level Prometheus

    servers collecting aggregated time series data from a larger number of subordinated servers
  19. stats . timers . accounts . ios . http .

    post . authenticate . response_time . upper_95
  20. ALERT <alert name> IF <expression> [ FOR <duration> ] [

    LABELS <label set> ] [ ANNOTATIONS <label set> ]
  21. ALERT ConsulRaftPeersLow IF consul_raft_peers < 5 FOR 1m LABELS {severity="page”,

    team=“infra”} ANNOTATIONS {description="consul raft peer count low: {{$value}}", summary="consul raft peer count low: {{$value}}"}
  22. ALERT QueueCritical IF sum (broker_q{svc_pref="prod"}) > 5000 FOR 10m LABELS

    {severity="page", team=”product"} ANNOTATIONS {description="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long", summary="service: {{$labels.service}} instance: {{$labels.instance}} queue length: {{$value}} for too long"}