Monitoring by Zabbix: discovering problems long before they turn into real disasters

Slide 1

Slide 1 text

Zabbix Detect problem before end users notice them

Slide 2

Slide 2 text

Who am I? Alexei Vladishev Creator of Zabbix CEO, Architect and Product Manager Locations: Riga, Tokyo, New York @avladishev

Slide 3

Slide 3 text

My talk What is Zabbix? Large scale monitoring Basic problem detection Advanced problem detection

Slide 4

Slide 4 text

What is Zabbix? Enterprise level Free and Open Source monitoring solution Beneﬁts of Zabbix • True Free software • All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries

Slide 5

Slide 5 text

What is Zabbix? Enterprise level Free and Open Source monitoring solution Beneﬁts of Zabbix True Free software All in one solution Easy to maintain Mature, high quality and reliable De-facto monitoring solution in many countries

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Large scale monitoring

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

11 terabits of outgoing trafﬁc 80 points of presence

Slide 10

Slide 10 text

25.000 hosts

Slide 11

Slide 11 text

6.000.000 metrics 3.000.000 triggers 90 proxies Zabbix performance 7.510 checks per second 25.000 hosts

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

60.000 hosts

Slide 14

Slide 14 text

60.000 hosts 2.000.000 metrics 20.000.000 triggers 6TB history 40 proxies Zabbix performance 21.000 checks per second

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

200.000 hosts

Slide 17

Slide 17 text

200.000 hosts 5.000.000 metrics 2.000.000 triggers Minimum 5ТB history 11.000 proxies Zabbix performance 20.000 checks per second

Slide 18

Slide 18 text

Why do they use Zabbix?

Slide 19

Slide 19 text

Because their business depends on IT services

Slide 20

Slide 20 text

More than 1.000.000 metrics and database size is more than 1TB #1: #2: #3: Oracle & DB2 5 %

Slide 21

Slide 21 text

More than 1.000.000 metrics and database size is more than 1TB #1: MySQL 80 % #2: PostgreSQL 15 % #3: Oracle & DB2 5 %

Slide 22

Slide 22 text

How Zabbix works DATABASE ZABBIX History Data collection

Slide 23

Slide 23 text

How Zabbix works DATABASE ZABBIX Visualisation History Analysis Data collection

Slide 24

Slide 24 text

How Zabbix works DATABASE ZABBIX Visualisation History Analysis Data collection Notiﬁcations

Slide 25

Slide 25 text

Data collection Availability, performance, integrity, environmental checks, KPI & SLA

Slide 26

Slide 26 text

Methods of data collection

Slide 27

Slide 27 text

Pull • Service checks: HTTP, SSH, IMAP, NTP, etc • Passive agent • Script execution using SSH and Telnet Push • Active agent • Zabbix Trapper and SNMP Traps • Monitoring of log ﬁles and Windows event logs Methods of data collection

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Push vs Pull

Slide 30

Slide 30 text

Push vs Pull

Slide 31

Slide 31 text

How to detect problems in this data ﬂow?

Slide 32

Slide 32 text

Triggers!

Slide 33

Slide 33 text

Trigger is problem condition

Slide 34

Slide 34 text

Triggers Example {server:system.cpu.load.last()} > 5 Operators - + / * < > = <> <= >= or and not Functions min max avg last count date time diff regexp and much more! Analyse everything: any metric and any host {node1:system.cpu.load.last()} > 5 and {node2:system.cpu.load.last()} > 5 and   {nodes:tps.last()} > 5000

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Junior level System is overloaded CPU load > 5

Slide 38

Slide 38 text

0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5

Slide 39

Slide 39 text

Too sensitive! 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 CPU load > 5 Recovery Recovery Problem Problem Problem

Slide 40

Slide 40 text

Junior level WEB server is down HTTP check failed.

Slide 41

Slide 41 text

0 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed

Slide 42

Slide 42 text

Too sensitive! 0 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 {server:net.tcp.service[http].last()} = 0 HTTP check failed Problem Recovery Problem Problem Recovery Recovery

Slide 43

Slide 43 text

Too sensitive leads to false positives

Slide 44

Slide 44 text

No trust!

Slide 45

Slide 45 text

How to get rid of false positives?

Slide 46

Slide 46 text

Properly deﬁne problem conditions and think carefully!

Slide 47

Slide 47 text

Properly deﬁne problem conditions and think carefully! system is overloaded a server is down a service is not available What really means ?

Slide 48

Slide 48 text

Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0

Slide 49

Slide 49 text

Take advantage of history System performance {server:system.cpu.load.min(10m)} > 5 Service availability {server:net.tcp.service[http].max(5m)} = 0 {server:net.tcp.service[http].max(#3)} = 0

Slide 50

Slide 50 text

Performance 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Recovery Problem {server:system.cpu.load.min(10m)} > 5

Slide 51

Slide 51 text

0 0,25 0,5 0,75 1 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:15 Recovery Problem Availability {server:net.tcp.service[http].max(#3)} = 0

Slide 52

Slide 52 text

Problem disappeared != problem is resolved

Slide 53

Slide 53 text

A few examples Problem: free disk space < 10%  No problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Resolved? Problem: SSH check failed  No problem: SSH is up Resolved?

Slide 54

Slide 54 text

A few examples Problem: free disk space < 10%  No problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Problem: SSH check failed  No problem: SSH is up Resolved?

Slide 55

Slide 55 text

A few examples Problem: free disk space < 10%  No problem: free disk space = 10.001% Problem: CPU load > 5  No problem: CPU load = 4.99 Problem: SSH check failed  No problem: SSH is up

Slide 56

Slide 56 text

Solution Before CPU load > 5 Now Problem: CPU load > 5 Recovery: CPU load < 1

Slide 57

Slide 57 text

Different conditions for problem and recovery! Before CPU load > 5 Now Problem: CPU load > 5 Recovery: CPU load < 1

Slide 58

Slide 58 text

Hysteresis 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 Problem Recovery Problem: CPU load > 5 Recovery: CPU load < 1

Slide 59

Slide 59 text

Finally we can trust alerts!

Slide 60

Slide 60 text

A few examples Problem if CPU load > 3 for the last 5 minutes    Recovery if CPU load < 1 for the last 2 minutes   Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK

Slide 61

Slide 61 text

Problem if CPU load > 3 for the last 5 minutes    Recovery if CPU load < 1 for the last 2 minutes   Problem if Free disk space < 10% Recovery if Free disk space > 30% for the last 15 minutes Problem if 3 consecutive checks of REST service failed Recovery if 10 consecutive checks of REST service are OK A few examples

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Anomalies

Slide 64

Slide 64 text

How to detect? Compare with a norm, where norm is system state in the past.

Slide 65

Slide 65 text

Average CPU load for the last hour is 2x higher than CPU load for the same period week ago {server:system.cpu.load.avg(1h)} > 2 * {server:system.cpu.load.avg(1h,7d)}

Slide 66

Slide 66 text

Anomaly 0 2,5 5 7,5 10 10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10 Compare with last week

Slide 67

Slide 67 text

Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309

Slide 68

Slide 68 text

Forecasting 0 12,5 25 37,5 50 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 y = -2,9455x + 48,309 When and value after period of time Future

Slide 69

Slide 69 text

How to react on problems?

Slide 70

Slide 70 text

Possible reactions • Sending alerts to user and user group • Instructions on how to ﬁx it • Automatic problem resolution • Opening tickets in Helpdesk systems

Slide 71

Slide 71 text

Possible reactions • Sending alerts to user and user group • Instructions on how to ﬁx it • Automatic problem resolution • Opening tickets in Helpdesk systems

Slide 72

Slide 72 text

Possible reactions • Sending alerts to user and user group • Instructions on how to ﬁx it • Automatic problem resolution • Opening tickets in Helpdesk systems

Slide 73

Slide 73 text

Escalate! Repeated notiﬁcations Escalation to a new level Notiﬁcation if automatic  action failed

Slide 74

Slide 74 text

Example Critical problem Repeated Email SMS and ticket 5 min 0 min

Slide 75

Slide 75 text

Example Critical problem Repeated Email SMS and ticket Service restart SMS to manager 5 min 10 min 15 min 0 min

Slide 76

Slide 76 text

Example Critical problem Repeated Email SMS and ticket Service restart SMS to manager 5 min 10 min 15 min 20 min 0 min

Slide 77

Slide 77 text

Monitoring is a a must have for business critical environments

Slide 78

Slide 78 text

Summary • Be smart about problem detection • No problem != solution    Use different conditions for problem and recovery • Predict problems: anomaly detection & forecasting • Resolve common problem automatically • Do not afraid to escalate!

Slide 79

Slide 79 text

We are looking for good front/back-end developers, support engineers and technical marketing guys!