Slide 1

Slide 1 text

Monitorama PDX 2014 Workshop

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What is flapjack?

Slide 4

Slide 4 text

Monitoring alert routing system

Slide 5

Slide 5 text

Composable

Slide 6

Slide 6 text

Rollup

Slide 7

Slide 7 text

Alert routing

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

• event

Slide 10

Slide 10 text

• event • ↪ notify?

Slide 11

Slide 11 text

• event • ↪ notify? • ↪ who?

Slide 12

Slide 12 text

• event • ↪ notify? • ↪ who? • ↪ how?

Slide 13

Slide 13 text

API driven

Slide 14

Slide 14 text

No restarts required

Slide 15

Slide 15 text

Developed + used in production at:

Slide 16

Slide 16 text

Developed + used in production at: Developers: Ali Graham Jesse Reynolds Project manager: Lindsay Holmwood

Slide 17

Slide 17 text

+

Slide 18

Slide 18 text

Designed for humans

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

Why flapjack ?

Slide 22

Slide 22 text

Specific use cases

Slide 23

Slide 23 text

Large numbers of potential contacts

Slide 24

Slide 24 text

Multi-tenant

Slide 25

Slide 25 text

Segregated responsibility

Slide 26

Slide 26 text

Works well with others

Slide 27

Slide 27 text

Check engine independence

Slide 28

Slide 28 text

Single pane of glass

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

Getting started with flapjack

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

• vagrant box add precise64 precise64.box

Slide 33

Slide 33 text

• vagrant box add precise64 precise64.box • git clone https://github.com/flpjck/vagrant-flapjack.git

Slide 34

Slide 34 text

• vagrant box add precise64 precise64.box • git clone https://github.com/flpjck/vagrant-flapjack.git • vagrant up

Slide 35

Slide 35 text

• vagrant box add precise64 precise64.box • git clone https://github.com/flpjck/vagrant-flapjack.git • vagrant up • # Flapjack web UI • open http://localhost:3080

Slide 36

Slide 36 text

• vagrant box add precise64 precise64.box • git clone https://github.com/flpjck/vagrant-flapjack.git • vagrant up • # Flapjack web UI • open http://localhost:3080 • # Check execution engines • open http://localhost:3083/nagios3 • open http://localhost:3083/incinga

Slide 37

Slide 37 text

Simulate a failing check

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

• simulate-failed-check fail-and-recover \

Slide 40

Slide 40 text

• simulate-failed-check fail-and-recover \ • --entity foo-app-01.example.com \

Slide 41

Slide 41 text

• simulate-failed-check fail-and-recover \ • --entity foo-app-01.example.com \ • --check Sausage \

Slide 42

Slide 42 text

• simulate-failed-check fail-and-recover \ • --entity foo-app-01.example.com \ • --check Sausage \ • --time 3

Slide 43

Slide 43 text

• simulate-failed-check fail-and-recover \ • --entity foo-app-01.example.com \ • --check Sausage \ • --time 3 • # View result • open http://localhost:3080/check? entity=foo-app-01.example.com&check=Sausage

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

Check provisioning quality control

Slide 46

Slide 46 text

Grace period on notifications

Slide 47

Slide 47 text

processor: enabled: yes queue: events notifier_queue: notifications archive_events: true events_archive_maxage: 10800 new_check_scheduled_maintenance_duration: 100 years logger: level: INFO syslog_errors: yes

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

No content

Slide 50

Slide 50 text

How does it work?

Slide 51

Slide 51 text

Data model

Slide 52

Slide 52 text

Contact

Slide 53

Slide 53 text

Contact Checks Checks Media

Slide 54

Slide 54 text

Contact Checks Checks Media Checks Checks Notification Rules

Slide 55

Slide 55 text

Contact Checks Checks Media Checks Checks Notification Rules Checks Checks Entities

Slide 56

Slide 56 text

Contact Checks Checks Media Checks Checks Notification Rules Checks Checks Checks Checks Checks Entities

Slide 57

Slide 57 text

Contact Checks Checks Media Checks Checks Notification Rules History (maintenance, acks, state changes) Checks Checks Checks Checks Checks Entities

Slide 58

Slide 58 text

Contacts

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

No content

Slide 63

Slide 63 text

Contact Checks Checks Media Checks Checks Notification Rules History (maintenance, acks, state changes) Checks Checks Checks Checks Checks Entities

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

Contact Checks Checks Media Checks Checks Notification Rules History (maintenance, acks, state changes) Checks Checks Checks Checks Checks Entities

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Architecture

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

event producers

Slide 70

Slide 70 text

event producers processors

Slide 71

Slide 71 text

event producers processors gateways

Slide 72

Slide 72 text

Event producers

Slide 73

Slide 73 text

processors gateways event producers

Slide 74

Slide 74 text

processors gateways Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder

Slide 75

Slide 75 text

# /etc/icinga/icinga.cfg broker_module=/usr/local/lib/flapjackfeeder.o ↪ redis_host=localhost,redis_port=6380

Slide 76

Slide 76 text

What is an event?

Slide 77

Slide 77 text

JSON

Slide 78

Slide 78 text

{ "entity": ENTITY, "check": CHECK, "tags": TAGS, "state": STATE, "time": TIMESTAMP, "summary": SUMMARY }

Slide 79

Slide 79 text

{ "entity": ENTITY, } ENTITY (string) Name of the relevant entity (e.g. FQDN)

Slide 80

Slide 80 text

{ "check": CHECK, } CHECK (string) The check name

Slide 81

Slide 81 text

{ "tags": TAGS, } TAGS (string[]) Array of tags pertaining to the event

Slide 82

Slide 82 text

{ "state": STATE, } STATE (string) One of 'ok', 'warning', 'critical', 'unknown'

Slide 83

Slide 83 text

{ "time": TIMESTAMP, } TIMESTAMP (integer) UNIX timestamp of the event's creation

Slide 84

Slide 84 text

{ "summary": SUMMARY } SUMMARY (string) The check output

Slide 85

Slide 85 text

{ "entity": ENTITY, "check": CHECK, "tags": TAGS, "state": STATE, "time": TIMESTAMP, "summary": SUMMARY }

Slide 86

Slide 86 text

{ "entity": ENTITY, "check": CHECK, "tags": TAGS, "state": STATE, "time": TIMESTAMP, "summary": SUMMARY } simulate-failed-check fail-and-recover \ --entity foo-app-01 \ --check Sausage \ --time 3

Slide 87

Slide 87 text

{ "entity": ENTITY, "check": CHECK, "tags": TAGS, "state": STATE, "time": TIMESTAMP, "summary": SUMMARY } { "entity": "foo-app-01", "check": "Sausage", "tags": null, "state": "critical", "time": 1399241087, "summary": "Simulated check" } simulate-failed-check fail-and-recover \ --entity foo-app-01 \ --check Sausage \ --time 3

Slide 88

Slide 88 text

Processors

Slide 89

Slide 89 text

processors gateways Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder

Slide 90

Slide 90 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder

Slide 91

Slide 91 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder

Slide 92

Slide 92 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder • Processor: processor

Slide 93

Slide 93 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder • Processor: • Read events processor

Slide 94

Slide 94 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder • Processor: • Read events • Update state (counters, history) processor

Slide 95

Slide 95 text

processor gateways notifier Sensu Sensu::Extension::Flapjack Nagios nagios-receiver Icinga flapjackfeeder • Processor: • Read events • Update state (counters, history) • Determine if notification should be sent processor

Slide 96

Slide 96 text

Notifier

Slide 97

Slide 97 text

processor gateways notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver

Slide 98

Slide 98 text

processor gateways notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver • Notifier: notifier

Slide 99

Slide 99 text

processor gateways notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver • Notifier: • Reads notifications notifier

Slide 100

Slide 100 text

processor gateways notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver • Notifier: • Reads notifications • Routes notifications notifier

Slide 101

Slide 101 text

processor gateways notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver • Notifier: • Reads notifications • Routes notifications • Dispatches alerts to gateways notifier

Slide 102

Slide 102 text

How are notifications routed?

Slide 103

Slide 103 text

event

Slide 104

Slide 104 text

event filters Find failing events

Slide 105

Slide 105 text

event filters Find failing events Processor Notifier

Slide 106

Slide 106 text

notification event filters Find failing events Processor Notifier

Slide 107

Slide 107 text

Find people interested in entity map [ alice bob, carol ] notification event filters Find failing events Processor Notifier

Slide 108

Slide 108 text

Find people interested in entity map map Find media owned by people [ [ alice, email ], [ alice, sms ], [ bob, email ], [ bob, sms ], [ carol, sms ], ] notification event filters Find failing events Processor Notifier

Slide 109

Slide 109 text

Find people interested in entity map map reduce Find media owned by people Delete media based on tags, severity, time of day [ [ alice, email ], [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events Processor Notifier

Slide 110

Slide 110 text

Find people interested in entity map map reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes [ [ alice, sms ], [ bob, sms ], ] notification event filters Find failing events Processor Notifier

Slide 111

Slide 111 text

Find people interested in entity map map reduce reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events [ [ alice, sms ], [ bob, sms ], ] Processor Notifier

Slide 112

Slide 112 text

Find people interested in entity map map reduce reduce reduce Find media owned by people Delete media based on tags, severity, time of day Delete media based on blackholes Delete media based on notification intervals notification event filters Find failing events alert alert [ [ alice, sms ], [ bob, sms ], ] Processor Notifier

Slide 113

Slide 113 text

Working with notification rules

Slide 114

Slide 114 text

curl -w 'response: %{http_code} \n' -X POST \ -H "Content-Type: application/vnd.api+json" -d \ '{ "notification_rules": [ { "entities": [ "foo-app-01.example.com" ], "regex_entities" : [ "^foo-\S{3}-\d{2}.example.com$" ], "tags": [ "database", "physical" ], "regex_tags" : [], "time_restrictions": [], "unknown_media": [], "warning_media": [ "email" ], "critical_media": [ "sms", "email" ], "unknown_blackhole": false, "warning_blackhole": false, "critical_blackhole": false } ] }' \ http://localhost:3081/contacts/5/notification_rules

Slide 115

Slide 115 text

flapjack.io/docs/jsonapi

Slide 116

Slide 116 text

No content

Slide 117

Slide 117 text

Gateways

Slide 118

Slide 118 text

processor notifier Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver gateways

Slide 119

Slide 119 text

processor notifier Email SMS Jabber PagerDuty Web API Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver

Slide 120

Slide 120 text

Unidirectional vs Bidirectional

Slide 121

Slide 121 text

processor notifier Email SMS Jabber PagerDuty Web API Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver

Slide 122

Slide 122 text

processor notifier Email SMS Jabber PagerDuty Web API Nagios flapjackfeeder Sensu Sensu::Extension::Flapjack Icinga nagios-receiver Jabber PagerDuty

Slide 123

Slide 123 text

PagerDuty + Nagios Double ACK

Slide 124

Slide 124 text

Flapjack syncs PagerDuty state

Slide 125

Slide 125 text

No content

Slide 126

Slide 126 text

Killer features

Slide 127

Slide 127 text

Self-checking (oobetet)

Slide 128

Slide 128 text

No content

Slide 129

Slide 129 text

event producers

Slide 130

Slide 130 text

event producers flapjack

Slide 131

Slide 131 text

event producers flapjack oobetet

Slide 132

Slide 132 text

No content

Slide 133

Slide 133 text

flapper

Slide 134

Slide 134 text

event producers flapper

Slide 135

Slide 135 text

event producers flapjack flapper

Slide 136

Slide 136 text

event producers flapjack jabber room flapper

Slide 137

Slide 137 text

event producers flapjack jabber room flapper oobetet

Slide 138

Slide 138 text

event producers flapjack jabber room flapper oobetet PagerDuty

Slide 139

Slide 139 text

No content

Slide 140

Slide 140 text

Alert summarisation (rollup)

Slide 141

Slide 141 text

No content

Slide 142

Slide 142 text

• per media,

Slide 143

Slide 143 text

• per media, • per contact

Slide 144

Slide 144 text

• per media, • per contact • thresholds

Slide 145

Slide 145 text

Contact Checks Checks Media Checks Checks Notification Rules History (maintenance, acks, state changes) Checks Checks Checks Checks Checks Entities

Slide 146

Slide 146 text

Contact Checks Checks Media Checks Checks Notification Rules History (maintenance, acks, state changes) Checks Checks Checks Checks Checks Entities Checks Checks Summary Threshold

Slide 147

Slide 147 text

No content

Slide 148

Slide 148 text

Tagging [watch this space]

Slide 149

Slide 149 text

No content

Slide 150

Slide 150 text

Things that may surprise you

Slide 151

Slide 151 text

Constant heartbeat

Slide 152

Slide 152 text

No one-off events

Slide 153

Slide 153 text

How long has a check been failing?

Slide 154

Slide 154 text

NOT "How many times has the check failed?"

Slide 155

Slide 155 text

Why?

Slide 156

Slide 156 text

No HARD/SOFT states

Slide 157

Slide 157 text

Broadcast delay

Slide 158

Slide 158 text

Alert summarisation (Rollup)

Slide 159

Slide 159 text

No content

Slide 160

Slide 160 text

Open Source

Slide 161

Slide 161 text

• github.com/flapjack/flapjack

Slide 162

Slide 162 text

Release planning in open github.com/flapjack/flapjack/wiki/Releasing

Slide 163

Slide 163 text

Policy for triaging bugs + features github.com/flapjack/flapjack/wiki/Releasing

Slide 164

Slide 164 text

Semantic versioning (2.0.0)

Slide 165

Slide 165 text

We write tests! Unit + Integration

Slide 166

Slide 166 text

We write tests! ~80% coverage

Slide 167

Slide 167 text

We run tests!

Slide 168

Slide 168 text

No content

Slide 169

Slide 169 text

•github.com /flapjack

Slide 170

Slide 170 text

•github.com /flapjack •github.com /flapjack/flapjack

Slide 171

Slide 171 text

•github.com /flapjack •github.com /flapjack/flapjack •github.com /flapjack/flapjack-diner

Slide 172

Slide 172 text

•github.com /flapjack •github.com /flapjack/flapjack •github.com /flapjack/flapjack-diner •github.com /flapjack/vagrant-flapjack

Slide 173

Slide 173 text

•github.com /flapjack •github.com /flapjack/flapjack •github.com /flapjack/flapjack-diner •github.com /flapjack/vagrant-flapjack •github.com /flapjack/omnibus-flapjack

Slide 174

Slide 174 text

•github.com /flapjack •github.com /flapjack/flapjack •github.com /flapjack/flapjack-diner •github.com /flapjack/vagrant-flapjack •github.com /flapjack/omnibus-flapjack •github.com /flapjack/packages.flapjack.io

Slide 175

Slide 175 text

•github.com /flapjack •github.com /flapjack/flapjack •github.com /flapjack/flapjack-diner •github.com /flapjack/vagrant-flapjack •github.com /flapjack/omnibus-flapjack •github.com /flapjack/packages.flapjack.io •

Slide 176

Slide 176 text

packages.flapjack.io

Slide 177

Slide 177 text

fat packages (omnibus)

Slide 178

Slide 178 text

.deb provided .rpm in the works (by you)

Slide 179

Slide 179 text

Quality documentation github.com/flapjack/flapjack/wiki

Slide 180

Slide 180 text

Bad documentation? BUG

Slide 181

Slide 181 text

Bad first experience? BUG

Slide 182

Slide 182 text

No content

Slide 183

Slide 183 text

Large numbers of potential contacts

Slide 184

Slide 184 text

Segregated responsibility

Slide 185

Slide 185 text

Single pane of glass

Slide 186

Slide 186 text

No content

Slide 187

Slide 187 text

Thank you! flapjack.io • github.com/flapjack/flapjack

Slide 188

Slide 188 text

Thank you! Liked the workshop? Let @auxesis + @jessereynolds know! flapjack.io • github.com/flapjack/flapjack

Slide 189

Slide 189 text

Post-Monitorama Monitoring Meetup @Jive - 18.30

Slide 190

Slide 190 text

Credits: http://www.flickr.com/photos/lizadaly/4373330774 http://www.flickr.com/photos/meltwater/420749031 http://www.flickr.com/photos/jonmould/5393395335 http://vmfarms.com/static/img/logos/ruby-logo.png http://www.flickr.com/photos/l1v32r1d3bmx/3985457584 http://www.flickr.com/photos/thomasforsyth/4313764488 http://www.flickr.com/photos/rubodewig/5161937181 http://www.flickr.com/photos/ronwls/7001551988 http://tosbourn.com/wp-content/uploads/2013/12/redis-logo.png?e0df77 http://www.flickr.com/photos/sparktography/83217827 http://www.flickr.com/photos/sdphotography/1570906849 http://www.flickr.com/photos/derekskey/9530097369 http://giphy.com/gifs/yeUxljCJjH1rW http://en.wikipedia.org/wiki/Broadcast_delay http://www.flickr.com/photos/whatknot/8642836187 http://www.flickr.com/photos/karen_d/8448507872 http://www.flickr.com/photos/buzzhoffman/4127280540 http://i.imgur.com/2UduUZ5.gif