Slide 1

Slide 1 text

MONITORING IS DEAD

Slide 2

Slide 2 text

Greg Poirier CTO @ opsee.com @grepory

Slide 3

Slide 3 text

A BRIEF HISTORY

Slide 4

Slide 4 text

The big five

Slide 5

Slide 5 text

CPU

Slide 6

Slide 6 text

uptime | mailx -s “cpu” root

Slide 7

Slide 7 text

Memory

Slide 8

Slide 8 text

free | mailx -s “mem” root

Slide 9

Slide 9 text

Disk

Slide 10

Slide 10 text

(df -h; du -sh /home/*)| mailx -s “disk” root

Slide 11

Slide 11 text

Process aliveness

Slide 12

Slide 12 text

(ps -ef | grep important)| mailx -s root

Slide 13

Slide 13 text

System aliveness

Slide 14

Slide 14 text

ping -c 4 giganews.com | mailx -s root

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

Thresholds for the threshold god.

Slide 17

Slide 17 text

OK: it < something

Slide 18

Slide 18 text

WARN: something < it < something

Slide 19

Slide 19 text

CRITICAL: it > something

Slide 20

Slide 20 text

What’s the problem?

Slide 21

Slide 21 text

You're either a one or a zero. Alive or dead.

Slide 22

Slide 22 text

But Greeeeeegggg…

Slide 23

Slide 23 text

Time-series data!

Slide 24

Slide 24 text

Derivatives!

Slide 25

Slide 25 text

Percentiles!

Slide 26

Slide 26 text

#slowclap

Slide 27

Slide 27 text

Everything we know is wrong.

Slide 28

Slide 28 text

What the hell is going on?

Slide 29

Slide 29 text

Calm the hell down, friends.

Slide 30

Slide 30 text

What is this? DevOps?

Slide 31

Slide 31 text

Let’s change the conversation.

Slide 32

Slide 32 text

A LINE IN THE SAND

Slide 33

Slide 33 text

What is monitoring?

Slide 34

Slide 34 text

Observability

Slide 35

Slide 35 text

A system is observable iff you can determine the behavior of the system based on its outputs.

Slide 36

Slide 36 text

A system is observable iff you can determine the behavior of the system based on its outputs.

Slide 37

Slide 37 text

A system is a set of connected components.

Slide 38

Slide 38 text

A system is observable iff you can determine the behavior of the system based on its outputs.

Slide 39

Slide 39 text

The manner in which a system acts is its behavior.

Slide 40

Slide 40 text

A system is observable iff you can determine the behavior of the system based on its outputs.

Slide 41

Slide 41 text

The outputs of a system are the concrete results of its behaviors.

Slide 42

Slide 42 text

A system is observable iff you can determine the behavior of the system based on its outputs.

Slide 43

Slide 43 text

What about monitoring?

Slide 44

Slide 44 text

A system is a set of connected components.

Slide 45

Slide 45 text

One or more sensors observe the state of a component.

Slide 46

Slide 46 text

An agent interprets data emitted by a sensor.

Slide 47

Slide 47 text

JFC, Greg. What is monitoring?

Slide 48

Slide 48 text

Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.

Slide 49

Slide 49 text

LET US DO THIS

Slide 50

Slide 50 text

Things change.

Slide 51

Slide 51 text

Fault detection

Slide 52

Slide 52 text

The ‘FLP result’ [1]

Slide 53

Slide 53 text

Byzantine Generals Problem [2]

Slide 54

Slide 54 text

Respond too slowly/ Fail to respond [3][4]

Slide 55

Slide 55 text

Service Level Objectives

Slide 56

Slide 56 text

Better health checks

Slide 57

Slide 57 text

DevOps this shit up.

Slide 58

Slide 58 text

Monitoring is part of building software.

Slide 59

Slide 59 text

.monitor.yml: metrics: - metric: "nsq.queue_length.results" assertions: - comparison: "> 0" time: "5m" http_checks: - method: "GET" path: "/health" assertions: - body_json: ".database.health = true" code: 200 rtt: "10ms"

Slide 60

Slide 60 text

Understand how your systems behave.

Slide 61

Slide 61 text

Build better tools.

Slide 62

Slide 62 text

Think about distributed systems.

Slide 63

Slide 63 text

• 1. Fischer, M. [Impossibility of Distributed Concensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April 1985, pp. 374-382. • 2. Lamport, L., Shostak, R., and Pease, M. [The Byzantine Generals Problem](http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf). in ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, July 1982, Pages 382-401. • 3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. [Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems](https://people.cs.pitt.edu/~melhem/courses/3530/papers/ft5.pdf). in IEEE Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111. • 4. Videla, A. [Failure Modes in Distributed Systems](http://videlalvaro.github.io/2013/12/failure-modes-in-distributed-systems.html). in his blog, December 2013. • [A Brief Tour of FLP Impossibility](http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/) • [Fault Management in Distributed Systems](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1960&context=cis_reports) • [The Case for Byzantine Fault Detection](https://www.usenix.org/legacy/event/hotdep06/tech/prelim_papers/haeberlen/haeberlen_html/index.html) • [Fail-Stop Processors](https://www.cs.cornell.edu/fbs/publications/FailStop.pdf) • [The Phi Accrual Failure Detector](http://fubica.lsd.ufcg.edu.br/hp/cursos/cfsc/papers/hayashibara04theaccrual.pdf) • [GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems](http://docs.hcs.ufl.edu/pubs/GEMS2005.pdf) • [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf) • [A Fault Detection Service for Wide Area Distributed Computations](http://toolkit.globus.org/ftppub/globus/papers/hbm.pdf) • [Fault Detection and Identification in Computer Networks: A Soft Computing Approach](https://uwspace.uwaterloo.ca/bitstream/handle/10012/4905/NFMS_PhD_Thesis_Jan7.pdf) • [BAR Fault Tolerance for Cooperative Services](https://www.cs.utexas.edu/~lorenzo/papers/sosp05.pdf) • [PeerReview: Practical Accountability for Distributed Systems](http://www.sosp2007.org/papers/sosp118-haeberlen.pdf) • [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274) • [Practical Byzantine Fault Tolerance and Proactive Recovery](http://research.microsoft.com/en-us/um/people/mcastro/publications/p398-castro-bft-tocs.pdf) • [Accrual Failure Detectors](http://www.jaist.ac.jp/jinzai/Report16/Hayashibara.pdf) • [Consistency in a Partitioned Network: A Survey](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1669&context=cis_reports) • [A Gossip-Style Failure Detection Service](https://www.cs.cornell.edu/home/rvr/papers/GossipFD.pdf) • [SWIM: Scalable Weakly-Consistent Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) • [Adaptive Diagnosis in Distributed Systems](http://www.research.ibm.com/people/r/rish/papers/IEEE.pdf) • [Dempster-Shafer theory](https://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory) • [A Simpley View of the Dempster-Shafer Theory of Evidence and its Implication for the Rule of Combination](http://people.eecs.berkeley.edu/~zadeh/papers/Dempster-Shafer_1986.pdf) • [Beyond Breakpoints: A Tour of Dynamic Analysis](https://github.com/dijkstracula/QConNYC2016) • [Trust by Verify: Accountability for Network Services](http://issg.cs.duke.edu/publications/trust-ew04.pdf)

Slide 64

Slide 64 text

<3 @grepory github.com/grepory/monitorama2016 opsee.com