Locaweb Monitoring

Locaweb Monitoring Plan Highway to heaven Tuesday, April 2, 13

Project Goals ๏ Delivery a functional and automated way to
create and use monitoring on Locaweb environment, making sure that all applications created and services running are monitored and our operations team are able to handle the workflow of monitoring without a huge number of false-positives checks. ๏ Train our developers to know how write applications thinking on monitoring and quality of service delivered for our customers. ๏ Document application dependency automatically through the process, we will be able to measure problems and find bottle necks in a easy way ๏ Reach an Autonomic Computing architecture in which the system is capable of regulating itself and thus enabling self-management and self-healing. AC was inspired by the operation of the human central nervous system. It draws an analogy between it and a complex, distributed information system. Unconscious processes, such as the control over the rate of breath, do not require human effort. The goal of AC is to minimize the need for human intervention in a similar way, by replacing it with self-regulation. Comprehensive monitoring can provide an effective means to achieve this end. Tuesday, April 2, 13

Prerequisites ๏ Cfengine ๏ Conductor Audit ๏ Python ๏ Leela
client ๏ Development teams adopt yaml for app monitoring Tuesday, April 2, 13

Current Nagios Environment ๏ Main issues ๏ Manual setup trough
Soap API ๏ Managed using configuration files without automation ๏ Act as a proxy for configuration but should be a manager ๏ Template creation and management is manually ๏ MIssing new checks addition after being in production ๏ Widespread configuration over environment ๏ Missing a central dashboard to whole environment ๏ Environments that isn’t possible to replicate with Cfengine ๏ Checks without SSL ๏ Security risk using nrpe with dont_blame_nrpe = 1 Internal Customer / Machine Nagios Server 1 Nagios Server 2 Nagios Server 3 Soap API Tuesday, April 2, 13

Current Nagios Environment ๏ Operational issues ๏ Terrible to calculate
SLA ๏ Missing escalation module ๏ Lots of false-positives checks ๏ Missing better threshold deﬁnitions ๏ Missing investigation to know why the incident happened ๏ High number of constant alarming services ~1000 usually ๏ 20000 alarms per day which 150 turning into real incidents Internal Customer / Machine Nagios Server 1 Nagios Server 2 Nagios Server 3 Soap API Tuesday, April 2, 13

Current Zabbix and Cacti Environments ๏ Issue ๏ Everything is
done manually inside the monitoring server Tuesday, April 2, 13

Desired Final Environment Locaweb App Standard Linux Service Nagios Monitoring
DB Configuration Agent Cegonha Dashboard Notification Agent Network Devices NetL2Api Simplemon Nagios Probe Notiication generated by Nagios check NetL2Api send network device information Conductor Audit send host information Cegonha send XMPP feed to Simplemon with activations, changes and deactivations Simplemon reads Cegonha feed, generates Nagios configuration and send to the database Notification agent triggers action for the alarm, automatic incident creation, SMS or E-mail Tuesday, April 2, 13

Effort needed Locaweb App Standard Linux Service Cegonha Conductor Audit
send host information Tuesday, April 2, 13

Effort needed ๏ Dashboard ๏ Create a dashboard on top
of Simplemon Monitoring DB making possible see our environment health ๏ Using relations between services is possible to know the size of crash in a crisis case Monitoring DB Dashboard Tuesday, April 2, 13

Effort needed ๏ Check MK ๏ Develop an interface to
use leela instead of use nagios pnp, and a new backend to collect perfdata and send to leela ๏ Our performance data start to be high available and turns easy to replace monitoring machines without lose data Check MK perfdata cagent perfdata leela read performance data and write to leela replace pnp graphs using leela Tuesday, April 2, 13

Phase 1 categorization and configuration ๏ yaml ๏ Validate format
๏ Define thresholds for each service ๏ Create teams templates audit Server planet express yaml for: services applications base override nagios check mk threshold and base talk with server and discover what needs check and configure agent Configuration Agent read audit url feeds cfengine app yaml goes to servers thru cfengine or apps create his own base monitoring Tuesday, April 2, 13

Phase 1 categorization and conﬁguration Tuesday, April 2, 13

Monitoring Relationships - lamp stack produto web www1 LB1 LB2
standby www2 DB1 master DB2 slave www4 produto adm produto dns Tuesday, April 2, 13

Monitoring Relationships - lamp stack aplication-name: mysqldb provides: db-master requires:
dns db-slave services: -mysql: sla: true sla-group: db-master -disk: sla: true sla-group: db-master-disk -network: sla: true sla-group: db-master-network DB1 master Tuesday, April 2, 13

DB2 slave Monitoring Relationships - lamp stack aplication-name: mysqldb provides:
db-slave requires: dns db-slave services: -mysql: sla: false -disk: sla: false -network: sla: false Tuesday, April 2, 13

www1 Monitoring Relationships - lamp stack aplication-name: apache provides: web
requires: dns db-master services: -httpd: sla: true sla-group: apache -disk: sla: true sla-group: apache-disk -network: sla: true sla-group: apache-network Tuesday, April 2, 13

Monitoring Relationships - lamp stack YAML :: produto web aplication-name:
produto-web provides: produto-web requires: load-balancer sla-metric: 99.7 produto web Tuesday, April 2, 13

Monitoring Relationships - lamp stack YAML :: produto adm aplication-name:
produto-adm provides: produto-adm requires: web-adm sla-metric: 97.0 produto adm Tuesday, April 2, 13

Monitoring Relationships - lamp stack produto web www1 LB1 LB2
standby www2 DB1 master DB2 slave www4 produto adm produto dns YAML :: produto aplication-name: produto requires: produto-adm produto-web Tuesday, April 2, 13

Monitoring Relationships - lamp stack Store you ﬁles on: /etc/locaweb/monitoring/*.yaml
- no subdirs - no ﬁles limit Tuesday, April 2, 13

Fim... pá Questions? Juliano Martinez Francisco Freire ps: Os 7
últimos slides foram roubados na cara larga do Vechiato Tuesday, April 2, 13

Locaweb Monitoring

Locaweb Monitoring

Juliano Martinez

Other Decks in Technology

Featured

Transcript