Slide 1

Slide 1 text

Monitoring As A Service (Scalability on monitoring) By Antonio C. C.

Slide 2

Slide 2 text

2 ● Hi! I’m Antonio ● SRE @ Cloudflare based in Singapore ● Involved with SIN team on our monitoring solutions About

Slide 3

Slide 3 text

3 ● As SRE, monitoring and visibility are key ● Challenges when we scale up our systems ● How did we solve it Introduction

Slide 4

Slide 4 text

4 ● <50 servers ● Slow growth ● ~100 servers was considered a huge environment Once upon a time… 2008

Slide 5

Slide 5 text

5 ● Monitoring Nagios - Well known & standard monitoring tool Worked great for such a scale Plenty of documentation and plug-ins Initial situation

Slide 6

Slide 6 text

6 ● Scalability ● SPOF Start to missing monitoring points as systems grows Frequent crashes as too many events come Not really good HA setup ● Rigidity Alerts are not easy to modify, rely on scripts As we grow… Problems

Slide 7

Slide 7 text

7 As we grow… Problems

Slide 8

Slide 8 text

8 ● Centralized Daemon Need to manage high number of connections Very common lose data in real time and freshness alerts ● Config file Centralized, rigid and complex. A single mistake - Breaks everything High number of dependencies Problems

Slide 9

Slide 9 text

9 ● What we were looking for? ● Active/standby setup. ● Ability to write custom alerting methods. (Pagerduty, Chat). ● Easy customize alert conditions as our environment changes fast. ● Nodes should be capable of self-registration and with auto-discovery of services Moving forward - Evaluating solutions

Slide 10

Slide 10 text

10 ● Why? ● Robust ● Highly available ● Easy Troubleshooting ● Can handle millions of timeseries (monitoring points) ● Create easily new alerts via PromSQL Option: Prometheus

Slide 11

Slide 11 text

11 ● Not trivial - 2 different monitoring philosophies ● During the process both systems need to co-exist ● Need to ensure reliability across both systems ● No space for downtime Implementation: Migration Challenges

Slide 12

Slide 12 text

12 ● Implementation - Horizontal Sharding - Verifying HTTP endpoints - Aggregating metrics - Defining alerting rules - Make sure rule logic will match with our current monitoring point Option: Prometheus

Slide 13

Slide 13 text

13 ● Implementation - Once alert / metric is ready. Deploy - Verify - Make sure the alert is firing/escalate properly - Disable alert in Nagios - Remove chunk of config - Reload Nagios service Option: Prometheus

Slide 14

Slide 14 text

14 Migration: Defining architecture

Slide 15

Slide 15 text

15 - Alerts in Nagios... - Based on scripts - Evaluate exit script code - Small Change of threshold requires script modification Migrating Alerts: Nagios2Prometheus

Slide 16

Slide 16 text

16 - Prometheus requires only a metric - Alert just require modify alerting condition Migrating Alerts: Nagios2Prometheus

Slide 17

Slide 17 text

17 ● Prometheus relies on exporters ● Exporters expose the data from third-party systems ○ Like PostgreSQL or Nginx ● Blackbox exporter most common for HTTP/TCP probes/metrics Migrating Alerts

Slide 18

Slide 18 text

18 ● Exporters put metric available on a HTTP endpoint ● Prometheus server scrape the metric ● Alert manager makes sure the alarm is triggered Migrating Alerts

Slide 19

Slide 19 text

19 ● But … not always possible to use Exporters ● Exporters require changes on Firewalls ● Deploy and additional process and reserve a new port ● Or simply does not exist Migrating Alerts

Slide 20

Slide 20 text

20 ● Solution: Textfile exporter ● Prometheus allows to expose any metric that follows the convention ● Run as a cronjob ● Metric is exposed as text Migrating Alerts

Slide 21

Slide 21 text

21 ● Outdated metrics ○ Claimed solved on the next prometheus version ● HA setup needs improvement Future challenges

Slide 22

Slide 22 text

22 Thanks !