Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Antonio Cocera - Monitoring As A Service

Antonio Cocera - Monitoring As A Service

Monitoring As A Service

The exponential grow of the numbers of servers and so-called micro services during the last years had drag some challenges on most of IT areas.

Monitoring been no exception, this critical part of any infrastructure required a huge an extra effort to migrate, as there was no place for monitoring blackout.

Our monitoring simply did not work, we were flood with timeouts as much as we kept adding services, becoming a pain of SRE on-call.

After long and thoughtful evaluation, we decided to move to prometheus, but this process was not straight forward Planning the full transition keeping the service online was painful, we had to build a bridge where a new system and old had to co-exist.

With all that set, we started migrated the alerts, which not only involve simply a re-write, but a change on the way we understood monitoring and understand better scalability concepts.

During the talk we will go through the following:

1) History initial situation at CloudFlare 2) Background of the issues we were encountering 3) Strategy for the migration 4) Process an execution

Why would this talk be a good fit for the DevOpsDays audience?
All DevOps engineers have to think on scalability. Monitoring, rather to be the exception is common across all platforms, tools and technologies. I believe sharing this experience will be valuable for teams that struggles to move away for legacy systems or simply that want to solve similar issues.
DevOpsDays Singapore 2017


DevOpsDays Singapore

October 26, 2017

More Decks by DevOpsDays Singapore

Other Decks in Technology


  1. Monitoring As A Service (Scalability on monitoring) By Antonio C.

  2. 2 • Hi! I’m Antonio • SRE @ Cloudflare based

    in Singapore • Involved with SIN team on our monitoring solutions About
  3. 3 • As SRE, monitoring and visibility are key •

    Challenges when we scale up our systems • How did we solve it Introduction
  4. 4 • <50 servers • Slow growth • ~100 servers

    was considered a huge environment Once upon a time… 2008
  5. 5 • Monitoring Nagios - Well known & standard monitoring

    tool Worked great for such a scale Plenty of documentation and plug-ins Initial situation
  6. 6 • Scalability • SPOF Start to missing monitoring points

    as systems grows Frequent crashes as too many events come Not really good HA setup • Rigidity Alerts are not easy to modify, rely on scripts As we grow… Problems
  7. 7 As we grow… Problems

  8. 8 • Centralized Daemon Need to manage high number of

    connections Very common lose data in real time and freshness alerts • Config file Centralized, rigid and complex. A single mistake - Breaks everything High number of dependencies Problems
  9. 9 • What we were looking for? • Active/standby setup.

    • Ability to write custom alerting methods. (Pagerduty, Chat). • Easy customize alert conditions as our environment changes fast. • Nodes should be capable of self-registration and with auto-discovery of services Moving forward - Evaluating solutions
  10. 10 • Why? • Robust • Highly available • Easy

    Troubleshooting • Can handle millions of timeseries (monitoring points) • Create easily new alerts via PromSQL Option: Prometheus
  11. 11 • Not trivial - 2 different monitoring philosophies •

    During the process both systems need to co-exist • Need to ensure reliability across both systems • No space for downtime Implementation: Migration Challenges
  12. 12 • Implementation - Horizontal Sharding - Verifying HTTP endpoints

    - Aggregating metrics - Defining alerting rules - Make sure rule logic will match with our current monitoring point Option: Prometheus
  13. 13 • Implementation - Once alert / metric is ready.

    Deploy - Verify - Make sure the alert is firing/escalate properly - Disable alert in Nagios - Remove chunk of config - Reload Nagios service Option: Prometheus
  14. 14 Migration: Defining architecture

  15. 15 - Alerts in Nagios... - Based on scripts -

    Evaluate exit script code - Small Change of threshold requires script modification Migrating Alerts: Nagios2Prometheus
  16. 16 - Prometheus requires only a metric - Alert just

    require modify alerting condition Migrating Alerts: Nagios2Prometheus
  17. 17 • Prometheus relies on exporters • Exporters expose the

    data from third-party systems ◦ Like PostgreSQL or Nginx • Blackbox exporter most common for HTTP/TCP probes/metrics Migrating Alerts
  18. 18 • Exporters put metric available on a HTTP endpoint

    • Prometheus server scrape the metric • Alert manager makes sure the alarm is triggered Migrating Alerts
  19. 19 • But … not always possible to use Exporters

    • Exporters require changes on Firewalls • Deploy and additional process and reserve a new port • Or simply does not exist Migrating Alerts
  20. 20 • Solution: Textfile exporter • Prometheus allows to expose

    any metric that follows the convention • Run as a cronjob • Metric is exposed as text Migrating Alerts
  21. 21 • Outdated metrics ◦ Claimed solved on the next

    prometheus version • HA setup needs improvement Future challenges
  22. 22 Thanks !