Antonio Cocera - Monitoring As A Service

Monitoring As A Service (Scalability on monitoring) By Antonio C.
C.

2 • Hi! I’m Antonio • SRE @ Cloudflare based
in Singapore • Involved with SIN team on our monitoring solutions About

3 • As SRE, monitoring and visibility are key •
Challenges when we scale up our systems • How did we solve it Introduction

4 • <50 servers • Slow growth • ~100 servers
was considered a huge environment Once upon a time… 2008

5 • Monitoring Nagios - Well known & standard monitoring
tool Worked great for such a scale Plenty of documentation and plug-ins Initial situation

6 • Scalability • SPOF Start to missing monitoring points
as systems grows Frequent crashes as too many events come Not really good HA setup • Rigidity Alerts are not easy to modify, rely on scripts As we grow… Problems

7 As we grow… Problems

8 • Centralized Daemon Need to manage high number of
connections Very common lose data in real time and freshness alerts • Config file Centralized, rigid and complex. A single mistake - Breaks everything High number of dependencies Problems

9 • What we were looking for? • Active/standby setup.
• Ability to write custom alerting methods. (Pagerduty, Chat). • Easy customize alert conditions as our environment changes fast. • Nodes should be capable of self-registration and with auto-discovery of services Moving forward - Evaluating solutions

10 • Why? • Robust • Highly available • Easy
Troubleshooting • Can handle millions of timeseries (monitoring points) • Create easily new alerts via PromSQL Option: Prometheus

11 • Not trivial - 2 different monitoring philosophies •
During the process both systems need to co-exist • Need to ensure reliability across both systems • No space for downtime Implementation: Migration Challenges

12 • Implementation - Horizontal Sharding - Verifying HTTP endpoints
- Aggregating metrics - Defining alerting rules - Make sure rule logic will match with our current monitoring point Option: Prometheus

13 • Implementation - Once alert / metric is ready.
Deploy - Verify - Make sure the alert is firing/escalate properly - Disable alert in Nagios - Remove chunk of config - Reload Nagios service Option: Prometheus

14 Migration: Defining architecture

15 - Alerts in Nagios... - Based on scripts -
Evaluate exit script code - Small Change of threshold requires script modification Migrating Alerts: Nagios2Prometheus

16 - Prometheus requires only a metric - Alert just
require modify alerting condition Migrating Alerts: Nagios2Prometheus

17 • Prometheus relies on exporters • Exporters expose the
data from third-party systems ◦ Like PostgreSQL or Nginx • Blackbox exporter most common for HTTP/TCP probes/metrics Migrating Alerts

18 • Exporters put metric available on a HTTP endpoint
• Prometheus server scrape the metric • Alert manager makes sure the alarm is triggered Migrating Alerts

19 • But … not always possible to use Exporters
• Exporters require changes on Firewalls • Deploy and additional process and reserve a new port • Or simply does not exist Migrating Alerts

20 • Solution: Textfile exporter • Prometheus allows to expose
any metric that follows the convention • Run as a cronjob • Metric is exposed as text Migrating Alerts

21 • Outdated metrics ◦ Claimed solved on the next
prometheus version • HA setup needs improvement Future challenges

22 Thanks !

Antonio Cocera - Monitoring As A Service

Antonio Cocera - Monitoring As A Service

DevOpsDays Singapore

More Decks by DevOpsDays Singapore

Other Decks in Technology

Featured

Transcript

Monitoring As A Service (Scalability on monitoring) By Antonio C.

2 • Hi! I’m Antonio • SRE @ Cloudflare based

3 • As SRE, monitoring and visibility are key •

4 • <50 servers • Slow growth • ~100 servers

5 • Monitoring Nagios - Well known & standard monitoring

6 • Scalability • SPOF Start to missing monitoring points

7 As we grow… Problems

8 • Centralized Daemon Need to manage high number of

9 • What we were looking for? • Active/standby setup.

10 • Why? • Robust • Highly available • Easy

11 • Not trivial - 2 different monitoring philosophies •

12 • Implementation - Horizontal Sharding - Verifying HTTP endpoints

13 • Implementation - Once alert / metric is ready.

14 Migration: Defining architecture

15 - Alerts in Nagios... - Based on scripts -

16 - Prometheus requires only a metric - Alert just

17 • Prometheus relies on exporters • Exporters expose the

18 • Exporters put metric available on a HTTP endpoint

19 • But … not always possible to use Exporters

20 • Solution: Textfile exporter • Prometheus allows to expose

21 • Outdated metrics ◦ Claimed solved on the next

22 Thanks !