Monitoring on a budget - Speaker Deck

Slide 1

Slide 1 text

monitoring on a budget

Slide 2

Slide 2 text

a few animated gifs with the Twelfth Doctor (0 cats)

Slide 3

Slide 3 text

C J Silverio vp of engineering, @ceejbot

Slide 4

Slide 4 text

let's talk npm by the numbers

Slide 5

Slide 5 text

205 million packages Tuesday 10K requests/sec

Slide 6

Slide 6 text

npm is 25 people 4 of us run the registry

Slide 7

Slide 7 text

when the company was formed 5 people total

Slide 8

Slide 8 text

you outsource many services when you're tiny

Slide 9

Slide 9 text

you pull them back in-house when you succeed

Slide 10

Slide 10 text

success is sometimes a catastrophe

Slide 11

Slide 11 text

npm's scale: runaway success npm's staff: wouldn't this be neat

Slide 12

Slide 12 text

mission: know this on a budget

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

2 questions: is the registry up? how well is it performing?

Slide 15

Slide 15 text

is the registry up? monitoring

Slide 16

Slide 16 text

how well is it performing? metrics

Slide 17

Slide 17 text

monitoring

Slide 18

Slide 18 text

monitoring == pull ask questions that you know the right answers for

Slide 19

Slide 19 text

Is this host up? Is this cert about to expire? Is the DB replication keeping up?

Slide 20

Slide 20 text

if you get the wrong answer somebody gets paged

Slide 21

Slide 21 text

nagios state of the art in free

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

It's okay. We never look at it. It just triggers Pager Duty.

Slide 24

Slide 24 text

nagios’s virtues: reliability & custom checks

Slide 25

Slide 25 text

goal: never page anybody

Slide 26

Slide 26 text

self-healing checks automate the ﬁx if you can!

Slide 27

Slide 27 text

monitoring == unit tests a ratchet for continuous improvement

Slide 28

Slide 28 text

external monitoring ping services

Slide 29

Slide 29 text

you must monitor but that's just the start

Slide 30

Slide 30 text

monitoring tells you what it doesn't tell you why

Slide 31

Slide 31 text

metrics

Slide 32

Slide 32 text

Q: What's a metric? A: A name + a value + a time.

Slide 33

Slide 33 text

counter: it happened N times gauge: it's Y-sized right now rate: it's happening N times/second timing: it took X milliseconds

Slide 34

Slide 34 text

metrics == push the app gives you numbers

Slide 35

Slide 35 text

emit from a service store in timeseries db query & graph

Slide 36

Slide 36 text

the usual stack statsd ➜ graphite ➜ grafana

Slide 37

Slide 37 text

No content

Slide 38

Slide 38 text

No content

Slide 39

Slide 39 text

statsd uses UDP

Slide 40

Slide 40 text

Q: Why not send metrics over UDP? A: You care about receiving them.

Slide 41

Slide 41 text

just try to install graphite

Slide 42

Slide 42 text

for-pay/SAAS services exist but I can't afford them

Slide 43

Slide 43 text

monitoring 400 processes right now 12+ GB of log data a day

Slide 44

Slide 44 text

interlude: when should you pay?

Slide 45

Slide 45 text

convert the £$€ cost into engineer hours/month

Slide 46

Slide 46 text

pay when it's cheaper than investing an engineer (be honest about the cost)

Slide 47

Slide 47 text

numbat was born “How hard can it be?” I said.

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

https://github.com/ numbat-metrics numbat - powered metrics

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

npm’s stack numbat ➜ inﬂuxdb ➜ grafana

Slide 52

Slide 52 text

Slide 53

Slide 53 text

so easy to emit a metric that we just do it any time something interesting happens

Slide 54

Slide 54 text

4000 metrics/sec from the registry

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

metrics ➜ alerts

Slide 60

Slide 60 text

Server handling expected trafﬁc? Latency higher than normal? Error rate higher than usual?