Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring by Zabbix: discovering problems long before they turn into real disasters

Webconf Riga
November 15, 2015

Monitoring by Zabbix: discovering problems long before they turn into real disasters

It does not matter how many servers or business critical applications you have. Monitoring could be your best friend when it comes to early problem detection and their automatic resolution. Zabbix provides great functionality that helps to identify existing and potential issues in an extremely efficient and smart way. No more downtimes!

Author:Alexei Vladishev
WebConf Riga 2015

Webconf Riga

November 15, 2015
Tweet

More Decks by Webconf Riga

Other Decks in Technology

Transcript

  1. Who am I? Alexei Vladishev Creator of Zabbix CEO, Architect

    and Product Manager Locations: Riga, Tokyo, New York @avladishev
  2. What is Zabbix? Enterprise level Free and Open Source monitoring

    solution Benefits of Zabbix • True Free software • All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries
  3. What is Zabbix? Enterprise level Free and Open Source monitoring

    solution Benefits of Zabbix True Free software All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries
  4. 200.000 hosts 5.000.000 metrics 2.000.000 triggers Minimum 5ТB history 11.000

    proxies Zabbix performance 20.000 checks per second
  5. More than 1.000.000 metrics and database size is more than

    1TB #1: MySQL 80 % #2: PostgreSQL 15 % #3: Oracle & DB2 5 %
  6. Pull • Service checks: HTTP, SSH, IMAP, NTP, etc •

    Passive agent • Script execution using SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log files and Windows event logs Methods of data collection
  7. Pull • Service checks: HTTP, SSH, IMAP, NTP, etc •

    Passive agent • Script execution using SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log files and Windows event logs Methods of data collection
  8. Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / *

    < > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! Analyse everything: any metric and any host {node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and 
 {nodes:tps.last()} > 5000
  9. Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / *

    < > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! Analyse everything: any metric and any host {node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and 
 {nodes:tps.last()} > 5000
  10. Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / *

    < > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! Analyse everything: any metric and any host {node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and 
 {nodes:tps.last()} > 5000
  11. 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20

    10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5
  12. Too sensitive! 0 2,5 5 7,5 10 10:00 10:05 10:10

    10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5 Recovery Recovery Problem Problem Problem
  13. 0 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08

    10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed
  14. Too sensitive! 0 1 10:01 10:02 10:03 10:04 10:05 10:06

    10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed Problem Recovery Problem Problem Recovery Recovery
  15. Properly define problem conditions and think carefully! system is overloaded

    a server is down a service is not available What really means ?
  16. Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service

    availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0
  17. Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service

    availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0
  18. Performance 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15

    10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Recovery Problem {server:system.cpu.load.min(10m)} > 5
  19. 0 0,25 0,5 0,75 1 10:01 10:02 10:03 10:04 10:05

    10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15 Recovery Problem Availability {server:net.tcp.service[http].max(#3)} = 0
  20. A few examples Problem: free disk space < 10%
 No

    problem: free disk space = 10.001% Problem: CPU load > 5
 No problem: CPU load = 4.99 Resolved? Problem: SSH check failed
 No problem: SSH is up Resolved?
  21. A few examples Problem: free disk space < 10%
 No

    problem: free disk space = 10.001% Problem: CPU load > 5
 No problem: CPU load = 4.99 Problem: SSH check failed
 No problem: SSH is up Resolved?
  22. A few examples Problem: free disk space < 10%
 No

    problem: free disk space = 10.001% Problem: CPU load > 5
 No problem: CPU load = 4.99 Problem: SSH check failed
 No problem: SSH is up
  23. Different conditions for problem and recovery! Before CPU load >

    5 Now Problem: CPU load > 5 Recovery: CPU load < 1
  24. Hysteresis 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15

    10:20 10:25 10:30 10:35 10:40 10:45 10:50 Problem Recovery Problem: CPU load > 5 Recovery: CPU load < 1
  25. A few examples Problem if CPU load > 3 for

    the last 5 minutes
 
 Recovery if CPU load < 1 for the last 2 minutes 
 Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK
  26. Problem if CPU load > 3 for the last 5

    minutes
 
 Recovery if CPU load < 1 for the last 2 minutes 
 Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK A few examples
  27. Problem if CPU load > 3 for the last 5

    minutes
 
 Recovery if CPU load < 1 for the last 2 minutes 
 Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK A few examples
  28. Average CPU load for the last hour is 2x higher

    than CPU load for the same period week ago {server:system.cpu.load.avg(1h)} > 2 * {server:system.cpu.load.avg(1h,7d)}
  29. Anomaly 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15

    10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Compare with last week
  30. Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00

    11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309
  31. Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00

    11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309 When and value after period of time Future
  32. Possible reactions • Sending alerts to user and user group

    • Instructions on how to fix it • Automatic problem resolution • Opening tickets in Helpdesk systems
  33. Possible reactions • Sending alerts to user and user group

    • Instructions on how to fix it • Automatic problem resolution • Opening tickets in Helpdesk systems
  34. Possible reactions • Sending alerts to user and user group

    • Instructions on how to fix it • Automatic problem resolution • Opening tickets in Helpdesk systems
  35. Example Critical problem Repeated Email SMS and ticket Service restart

    SMS to manager 5 min 10 min 15 min 20 min 0 min
  36. Summary • Be smart about problem detection • No problem

    != solution
 
 Use different conditions for problem and recovery • Predict problems: anomaly detection & forecasting • Resolve common problem automatically • Do not afraid to escalate!
  37. Join our team in Riga, Tokyo and New York to

    build open source (free) software!