Deep Dive to Prometheus Monitoring

Deep Dive to Prometheus Monitoring Nancy Chauhan Software Developer @Grofers
H Y D E R A B A D

© 2020 Cloud Native Computing Foundation 2 $ whoami •
Developer @Grofers • Loves to make cool things ranging from hardware to infra. • Has hosted her code at @Nancy-Chauhan • Speaks her mind at @_nancychauhan • Lives in Gurgaon, India

© 2020 Cloud Native Computing Foundation 3 Agenda • Prometheus
• Metric types of Prometheus • Configuring Prometheus for reliability • Scaling Prometheus • Prometheus HA using 3rd party tools like thanos • Monitoring Prometheus • Alternate Architecture for HA

© 2020 Cloud Native Computing Foundation 4 Monitoring is for
symptom based alerting !

© 2020 Cloud Native Computing Foundation 5 Why monitor ?
• Know when things go wrong ◦ To call in a human to prevent a business-level issue • Be able to debug and gain insight • Trending to see changes over time, and drive technical/business decisions • To feed into other systems/processes

© 2020 Cloud Native Computing Foundation 12 Prometheus Metric Type
Counter Gauge Histogram Summary Monotonically Increasing A Time Series Cumulative Histogram of Values Snapshot of Values in Time Window

© 2020 Cloud Native Computing Foundation 13 Counter : the
only way is up Use counters for counting events, jobs, money etc. where a cumulative value is useful Example # Total number of completed cleanup jobs by datacenter (will be affected by restarts) sum(batch_jobs_completed_total{job_type="hou rly-cleanup"}) by (datacenter) # Number of completed cleanup jobs per second, by datacenter (not affected by restarts) sum(rate(batch_jobs_completed_total{job_type ="hourly-cleanup"}[5m])) by (datacenter)

© 2020 Cloud Native Computing Foundation 14 Gauges: the current
picture of your infrastructure Use where the current value is important – CPU, RAM, JVM memory usage etc. # Amount of memory currently used memory_bytes_used # Number of jobs currently in queue batch_jobs_in_queue{job_type="hourly-cleanup"}

© 2020 Cloud Native Computing Foundation 15 Histograms: Sampling Observations
Use where a overall picture over a time frame is required – query times, http response times, # Request duration 90th percentile histogram_quantile(0.9, rate(http_request_duration_milliseconds_bucket [5m]))

© 2020 Cloud Native Computing Foundation 16 Summaries:client side quantiles
Similar in spirit to the Histogram, with the difference being that quantiles are calculated on the client-side as well. Use when you start using quantile values frequently with one or more histogram metrics. Example: the built-in Golang garbage collector summary reporting various quartiles (as reported by client): go_gc_duration_seconds{quantile= "0"} 4.274e-05 go_gc_duration_seconds{quantile= "0.25"} 6.8508e-05 go_gc_duration_seconds{quantile= "0.5"} 0.000275171 go_gc_duration_seconds{quantile= "0.75"} 0.002328529 go_gc_duration_seconds{quantile= "1"} 0.201453313 go_gc_duration_seconds_sum 0.467543895 go_gc_duration_seconds_count 92

© 2020 Cloud Native Computing Foundation 17 Setting up Prometheus
for Reliability • Avoid prometheus being a central point of failure • Scale prometheus to handle the volume of metrics generated by your application • Dedicated monitoring solution for Prometheus itself • Have centralized observability • Consider looking into alternate architecture if nothing works such as introducing a queue before Prometheus

© 2020 Cloud Native Computing Foundation 18 Scaling Prometheus Prometheus
is stateful and doesn’t allow for replication of its database Traditional methods such as load-balancing over replicas will not work.

© 2020 Cloud Native Computing Foundation 19 Prometheus HA :
Sharding Prometheus • First step towards scaling your Prometheus architecture. • Prepare application groups • Assign a Prometheus instance to monitor a single app group. • You can make such a grouping per-cluster or by application’s relative importance, SLAs etc.

© 2020 Cloud Native Computing Foundation 20 Thanos: Highly Available
Prometheus Setup • Functional sharding of Prometheus with a central view • Unlimited long-term storage of data • Behaves as a meta-prometheus allowing to query multiple prometheus instances from a single point

© 2020 Cloud Native Computing Foundation 21 Thanos: Working •
Thanos sidecar to pull prometheus data and send to Thanos Store • Thanos store stores data in a long-term storage after deduplication. • Thanos querier to query multiple prometheus instances simultaneously • Thanos compactor for downsampling historical data • Even if your local prometheus is unavailable, data can be fetched from Thanos Store.

© 2020 Cloud Native Computing Foundation 22 Monitoring Prometheus •
Who watches the watchers? • Monitoring prometheus is critical • Have a dedicated prometheus instance to manage all other prometheus instances • You can consider using a third-party service such as Datadog or NewRelic

© 2020 Cloud Native Computing Foundation 23 Alternate architectures for
HA • Push based architecture ◦ Application push data to a statsd like server (collector) ◦ Collector publishes data to a queue (such as Kafka) ◦ Prometheus scrapes from a purpose-built exporter that reads from Kafka • Advantages: ◦ Reduces loss of data due to prometheus unavailability. ◦ Easy long term storage by consuming queue data • Disadvantages: ◦ Non-standard ◦ Complex

© 2020 Cloud Native Computing Foundation 24 Alternate architectures for
HA Utilising Integration with prometheus Lots of companies use prometheus as scraper and export the data into separate system. • CloudWatch • Cortex • Uber M3

Thank you H Y D E R A B A
D

Deep Dive to Prometheus Monitoring

Deep Dive to Prometheus Monitoring

Nancy Chauhan

More Decks by Nancy Chauhan

Other Decks in Programming

Featured

Transcript

Deep Dive to Prometheus Monitoring Nancy Chauhan Software Developer @Grofers

© 2020 Cloud Native Computing Foundation 2 $ whoami •

© 2020 Cloud Native Computing Foundation 3 Agenda • Prometheus

© 2020 Cloud Native Computing Foundation 4 Monitoring is for

© 2020 Cloud Native Computing Foundation 5 Why monitor ?

© 2020 Cloud Native Computing Foundation 6 Architecture Diagram

© 2020 Cloud Native Computing Foundation 7 Prometheus Architecture

© 2020 Cloud Native Computing Foundation 8 Prometheus Architecture

© 2020 Cloud Native Computing Foundation 9 Prometheus Architecture

© 2020 Cloud Native Computing Foundation 10 Prometheus Architecture

© 2020 Cloud Native Computing Foundation 11 Prometheus Architecture

© 2020 Cloud Native Computing Foundation 12 Prometheus Metric Type

© 2020 Cloud Native Computing Foundation 13 Counter : the

© 2020 Cloud Native Computing Foundation 14 Gauges: the current

© 2020 Cloud Native Computing Foundation 15 Histograms: Sampling Observations

© 2020 Cloud Native Computing Foundation 16 Summaries:client side quantiles

© 2020 Cloud Native Computing Foundation 17 Setting up Prometheus

© 2020 Cloud Native Computing Foundation 18 Scaling Prometheus Prometheus

© 2020 Cloud Native Computing Foundation 19 Prometheus HA :

© 2020 Cloud Native Computing Foundation 20 Thanos: Highly Available

© 2020 Cloud Native Computing Foundation 21 Thanos: Working •

© 2020 Cloud Native Computing Foundation 22 Monitoring Prometheus •

© 2020 Cloud Native Computing Foundation 23 Alternate architectures for

© 2020 Cloud Native Computing Foundation 24 Alternate architectures for

Thank you H Y D E R A B A