Slide 1

Slide 1 text

Deep Dive to Prometheus Monitoring Nancy Chauhan Software Developer @Grofers H Y D E R A B A D

Slide 2

Slide 2 text

© 2020 Cloud Native Computing Foundation 2 $ whoami • Developer @Grofers • Loves to make cool things ranging from hardware to infra. • Has hosted her code at @Nancy-Chauhan • Speaks her mind at @_nancychauhan • Lives in Gurgaon, India

Slide 3

Slide 3 text

© 2020 Cloud Native Computing Foundation 3 Agenda • Prometheus • Metric types of Prometheus • Configuring Prometheus for reliability • Scaling Prometheus • Prometheus HA using 3rd party tools like thanos • Monitoring Prometheus • Alternate Architecture for HA

Slide 4

Slide 4 text

© 2020 Cloud Native Computing Foundation 4 Monitoring is for symptom based alerting !

Slide 5

Slide 5 text

© 2020 Cloud Native Computing Foundation 5 Why monitor ? ● Know when things go wrong ○ To call in a human to prevent a business-level issue ● Be able to debug and gain insight ● Trending to see changes over time, and drive technical/business decisions ● To feed into other systems/processes

Slide 6

Slide 6 text

© 2020 Cloud Native Computing Foundation 6 Architecture Diagram

Slide 7

Slide 7 text

© 2020 Cloud Native Computing Foundation 7 Prometheus Architecture

Slide 8

Slide 8 text

© 2020 Cloud Native Computing Foundation 8 Prometheus Architecture

Slide 9

Slide 9 text

© 2020 Cloud Native Computing Foundation 9 Prometheus Architecture

Slide 10

Slide 10 text

© 2020 Cloud Native Computing Foundation 10 Prometheus Architecture

Slide 11

Slide 11 text

© 2020 Cloud Native Computing Foundation 11 Prometheus Architecture

Slide 12

Slide 12 text

© 2020 Cloud Native Computing Foundation 12 Prometheus Metric Type Counter Gauge Histogram Summary Monotonically Increasing A Time Series Cumulative Histogram of Values Snapshot of Values in Time Window

Slide 13

Slide 13 text

© 2020 Cloud Native Computing Foundation 13 Counter : the only way is up Use counters for counting events, jobs, money etc. where a cumulative value is useful Example # Total number of completed cleanup jobs by datacenter (will be affected by restarts) sum(batch_jobs_completed_total{job_type="hou rly-cleanup"}) by (datacenter) # Number of completed cleanup jobs per second, by datacenter (not affected by restarts) sum(rate(batch_jobs_completed_total{job_type ="hourly-cleanup"}[5m])) by (datacenter)

Slide 14

Slide 14 text

© 2020 Cloud Native Computing Foundation 14 Gauges: the current picture of your infrastructure Use where the current value is important – CPU, RAM, JVM memory usage etc. # Amount of memory currently used memory_bytes_used # Number of jobs currently in queue batch_jobs_in_queue{job_type="hourly-cleanup"}

Slide 15

Slide 15 text

© 2020 Cloud Native Computing Foundation 15 Histograms: Sampling Observations Use where a overall picture over a time frame is required – query times, http response times, # Request duration 90th percentile histogram_quantile(0.9, rate(http_request_duration_milliseconds_bucket [5m]))

Slide 16

Slide 16 text

© 2020 Cloud Native Computing Foundation 16 Summaries:client side quantiles Similar in spirit to the Histogram, with the difference being that quantiles are calculated on the client-side as well. Use when you start using quantile values frequently with one or more histogram metrics. Example: the built-in Golang garbage collector summary reporting various quartiles (as reported by client): go_gc_duration_seconds{quantile= "0"} 4.274e-05 go_gc_duration_seconds{quantile= "0.25"} 6.8508e-05 go_gc_duration_seconds{quantile= "0.5"} 0.000275171 go_gc_duration_seconds{quantile= "0.75"} 0.002328529 go_gc_duration_seconds{quantile= "1"} 0.201453313 go_gc_duration_seconds_sum 0.467543895 go_gc_duration_seconds_count 92

Slide 17

Slide 17 text

© 2020 Cloud Native Computing Foundation 17 Setting up Prometheus for Reliability ● Avoid prometheus being a central point of failure ● Scale prometheus to handle the volume of metrics generated by your application ● Dedicated monitoring solution for Prometheus itself ● Have centralized observability ● Consider looking into alternate architecture if nothing works such as introducing a queue before Prometheus

Slide 18

Slide 18 text

© 2020 Cloud Native Computing Foundation 18 Scaling Prometheus Prometheus is stateful and doesn’t allow for replication of its database Traditional methods such as load-balancing over replicas will not work.

Slide 19

Slide 19 text

© 2020 Cloud Native Computing Foundation 19 Prometheus HA : Sharding Prometheus ● First step towards scaling your Prometheus architecture. ● Prepare application groups ● Assign a Prometheus instance to monitor a single app group. ● You can make such a grouping per-cluster or by application’s relative importance, SLAs etc.

Slide 20

Slide 20 text

© 2020 Cloud Native Computing Foundation 20 Thanos: Highly Available Prometheus Setup ● Functional sharding of Prometheus with a central view ● Unlimited long-term storage of data ● Behaves as a meta-prometheus allowing to query multiple prometheus instances from a single point

Slide 21

Slide 21 text

© 2020 Cloud Native Computing Foundation 21 Thanos: Working ● Thanos sidecar to pull prometheus data and send to Thanos Store ● Thanos store stores data in a long-term storage after deduplication. ● Thanos querier to query multiple prometheus instances simultaneously ● Thanos compactor for downsampling historical data ● Even if your local prometheus is unavailable, data can be fetched from Thanos Store.

Slide 22

Slide 22 text

© 2020 Cloud Native Computing Foundation 22 Monitoring Prometheus ● Who watches the watchers? ● Monitoring prometheus is critical ● Have a dedicated prometheus instance to manage all other prometheus instances ● You can consider using a third-party service such as Datadog or NewRelic

Slide 23

Slide 23 text

© 2020 Cloud Native Computing Foundation 23 Alternate architectures for HA ● Push based architecture ○ Application push data to a statsd like server (collector) ○ Collector publishes data to a queue (such as Kafka) ○ Prometheus scrapes from a purpose-built exporter that reads from Kafka ● Advantages: ○ Reduces loss of data due to prometheus unavailability. ○ Easy long term storage by consuming queue data ● Disadvantages: ○ Non-standard ○ Complex

Slide 24

Slide 24 text

© 2020 Cloud Native Computing Foundation 24 Alternate architectures for HA Utilising Integration with prometheus Lots of companies use prometheus as scraper and export the data into separate system. • CloudWatch • Cortex • Uber M3

Slide 25

Slide 25 text

Thank you H Y D E R A B A D