Monitoring by Zabbix: discovering problems long before they turn into real disasters

Zabbix Detect problem before end users notice them

Who am I? Alexei Vladishev Creator of Zabbix CEO, Architect
and Product Manager Locations: Riga, Tokyo, New York @avladishev

My talk What is Zabbix? Large scale monitoring Basic problem
detection Advanced problem detection

What is Zabbix? Enterprise level Free and Open Source monitoring
solution Beneﬁts of Zabbix • True Free software • All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries

What is Zabbix? Enterprise level Free and Open Source monitoring
solution Beneﬁts of Zabbix True Free software All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries

Large scale monitoring

11 terabits of outgoing trafﬁc 80 points of presence

25.000 hosts

6.000.000 metrics 3.000.000 triggers 90 proxies Zabbix performance 7.510 checks
per second 25.000 hosts

60.000 hosts

60.000 hosts 2.000.000 metrics 20.000.000 triggers 6TB history 40 proxies
Zabbix performance 21.000 checks per second

200.000 hosts

200.000 hosts 5.000.000 metrics 2.000.000 triggers Minimum 5ТB history 11.000
proxies Zabbix performance 20.000 checks per second

Why do they use Zabbix?

Because their business depends on IT services

More than 1.000.000 metrics and database size is more than
1TB #1: #2: #3: Oracle & DB2 5 %

More than 1.000.000 metrics and database size is more than
1TB #1: MySQL 80 % #2: PostgreSQL 15 % #3: Oracle & DB2 5 %

How Zabbix works DATABASE ZABBIX History Data collection

How Zabbix works DATABASE ZABBIX Visualisation History Analysis Data collection

How Zabbix works DATABASE ZABBIX Visualisation History Analysis Data collection
Notiﬁcations

Data collection Availability, performance, integrity, environmental checks, KPI & SLA

Methods of data collection

Pull • Service checks: HTTP, SSH, IMAP, NTP, etc •
Passive agent • Script execution using SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log ﬁles and Windows event logs Methods of data collection

Push vs Pull

How to detect problems in this data ﬂow?

Triggers!

Trigger is problem condition

Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / *
< > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! Analyse everything: any metric and any host {node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and   {nodes:tps.last()} > 5000

Junior level System is overloaded CPU load > 5

0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20
10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5

Too sensitive! 0 2,5 5 7,5 10 10:00 10:05 10:10
10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5 Recovery Recovery Problem Problem Problem

Junior level WEB server is down HTTP check failed.

0 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08
10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed

Too sensitive! 0 1 10:01 10:02 10:03 10:04 10:05 10:06
10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed Problem Recovery Problem Problem Recovery Recovery

Too sensitive leads to false positives

No trust!

How to get rid of false positives?

Properly deﬁne problem conditions and think carefully!

Properly deﬁne problem conditions and think carefully! system is overloaded
a server is down a service is not available What really means ?

Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service
availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0

Performance 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15
10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Recovery Problem {server:system.cpu.load.min(10m)} > 5

0 0,25 0,5 0,75 1 10:01 10:02 10:03 10:04 10:05
10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15 Recovery Problem Availability {server:net.tcp.service[http].max(#3)} = 0

Problem disappeared != problem is resolved

A few examples Problem: free disk space < 10%  No
problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Resolved? Problem: SSH check failed  No problem: SSH is up Resolved?

problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Problem: SSH check failed  No problem: SSH is up Resolved?

problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Problem: SSH check failed  No problem: SSH is up

Solution Before CPU load > 5 Now Problem: CPU load
> 5 Recovery: CPU load < 1

Different conditions for problem and recovery! Before CPU load >
5 Now Problem: CPU load > 5 Recovery: CPU load < 1

Hysteresis 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15
10:20 10:25 10:30 10:35 10:40 10:45 10:50 Problem Recovery Problem: CPU load > 5 Recovery: CPU load < 1

Finally we can trust alerts!

A few examples Problem if CPU load > 3 for
the last 5 minutes    Recovery if CPU load < 1 for the last 2 minutes   Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK

Problem if CPU load > 3 for the last 5
minutes    Recovery if CPU load < 1 for the last 2 minutes   Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK A few examples

Anomalies

How to detect? Compare with a norm, where norm is
system state in the past.

Average CPU load for the last hour is 2x higher
than CPU load for the same period week ago {server:system.cpu.load.avg(1h)} > 2 * {server:system.cpu.load.avg(1h,7d)}

Anomaly 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15
10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Compare with last week

Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309

Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00
11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309 When and value after period of time Future

How to react on problems?

Possible reactions • Sending alerts to user and user group
• Instructions on how to ﬁx it • Automatic problem resolution • Opening tickets in Helpdesk systems

Escalate! Repeated notiﬁcations Escalation to a new level Notiﬁcation if
automatic  action failed

Example Critical problem Repeated Email SMS and ticket 5 min
0 min

Example Critical problem Repeated Email SMS and ticket Service restart
SMS to manager 5 min 10 min 15 min 0 min

Example Critical problem Repeated Email SMS and ticket Service restart
SMS to manager 5 min 10 min 15 min 20 min 0 min

Monitoring is a a must have for business critical environments

Summary • Be smart about problem detection • No problem
!= solution    Use different conditions for problem and recovery • Predict problems: anomaly detection & forecasting • Resolve common problem automatically • Do not afraid to escalate!

We are looking for good front/back-end developers, support engineers and
technical marketing guys!

Join our team in Riga, Tokyo and New York to
build open source (free) software!

Thank you! twitter.com/zabbix

Monitoring by Zabbix: discovering problems long...

Monitoring by Zabbix: discovering problems long before they turn into real disasters

More Decks by Webconf Riga

Other Decks in Technology

Featured

Transcript