Monitoring is Dead

MONITORING IS DEAD

Greg Poirier CTO @ opsee.com @grepory

A BRIEF HISTORY

The big ﬁve

uptime | mailx -s “cpu” root

Memory

free | mailx -s “mem” root

(df -h; du -sh /home/*)| mailx -s “disk” root

Process aliveness

(ps -ef | grep important)| mailx -s root

System aliveness

ping -c 4 giganews.com | mailx -s root

Thresholds for the threshold god.

OK: it < something

WARN: something < it < something

CRITICAL: it > something

What’s the problem?

You're either a one or a zero. Alive or dead.

But Greeeeeegggg…

Time-series data!

Derivatives!

Percentiles!

#slowclap

Everything we know is wrong.

What the hell is going on?

Calm the hell down, friends.

What is this? DevOps?

Let’s change the conversation.

A LINE IN THE SAND

What is monitoring?

Observability

A system is observable iff you can determine the behavior
of the system based on its outputs.

A system is a set of connected components.

The manner in which a system acts is its behavior.

The outputs of a system are the concrete results of
its behaviors.

What about monitoring?

A system is a set of connected components.

One or more sensors observe the state of a component.

An agent interprets data emitted by a sensor.

JFC, Greg. What is monitoring?

Monitoring is the action of observing and checking the behavior
and outputs of a system and its components over time.

LET US DO THIS

Things change.

Fault detection

The ‘FLP result’ [1]

Byzantine Generals Problem [2]

Respond too slowly/ Fail to respond [3][4]

Service Level Objectives

Better health checks

DevOps this shit up.

Monitoring is part of building software.

.monitor.yml: metrics: - metric: "nsq.queue_length.results" assertions: - comparison: "> 0"
time: "5m" http_checks: - method: "GET" path: "/health" assertions: - body_json: ".database.health = true" code: 200 rtt: "10ms"

Understand how your systems behave.

Build better tools.

Think about distributed systems.

• 1. Fischer, M. [Impossibility of Distributed Concensus with One
Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April 1985, pp. 374-382. • 2. Lamport, L., Shostak, R., and Pease, M. [The Byzantine Generals Problem](http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf). in ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, July 1982, Pages 382-401. • 3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. [Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems](https://people.cs.pitt.edu/~melhem/courses/3530/papers/ft5.pdf). in IEEE Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111. • 4. Videla, A. [Failure Modes in Distributed Systems](http://videlalvaro.github.io/2013/12/failure-modes-in-distributed-systems.html). in his blog, December 2013. • [A Brief Tour of FLP Impossibility](http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/) • [Fault Management in Distributed Systems](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1960&context=cis_reports) • [The Case for Byzantine Fault Detection](https://www.usenix.org/legacy/event/hotdep06/tech/prelim_papers/haeberlen/haeberlen_html/index.html) • [Fail-Stop Processors](https://www.cs.cornell.edu/fbs/publications/FailStop.pdf) • [The Phi Accrual Failure Detector](http://fubica.lsd.ufcg.edu.br/hp/cursos/cfsc/papers/hayashibara04theaccrual.pdf) • [GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems](http://docs.hcs.ufl.edu/pubs/GEMS2005.pdf) • [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf) • [A Fault Detection Service for Wide Area Distributed Computations](http://toolkit.globus.org/ftppub/globus/papers/hbm.pdf) • [Fault Detection and Identification in Computer Networks: A Soft Computing Approach](https://uwspace.uwaterloo.ca/bitstream/handle/10012/4905/NFMS_PhD_Thesis_Jan7.pdf) • [BAR Fault Tolerance for Cooperative Services](https://www.cs.utexas.edu/~lorenzo/papers/sosp05.pdf) • [PeerReview: Practical Accountability for Distributed Systems](http://www.sosp2007.org/papers/sosp118-haeberlen.pdf) • [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274) • [Practical Byzantine Fault Tolerance and Proactive Recovery](http://research.microsoft.com/en-us/um/people/mcastro/publications/p398-castro-bft-tocs.pdf) • [Accrual Failure Detectors](http://www.jaist.ac.jp/jinzai/Report16/Hayashibara.pdf) • [Consistency in a Partitioned Network: A Survey](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1669&context=cis_reports) • [A Gossip-Style Failure Detection Service](https://www.cs.cornell.edu/home/rvr/papers/GossipFD.pdf) • [SWIM: Scalable Weakly-Consistent Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) • [Adaptive Diagnosis in Distributed Systems](http://www.research.ibm.com/people/r/rish/papers/IEEE.pdf) • [Dempster-Shafer theory](https://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory) • [A Simpley View of the Dempster-Shafer Theory of Evidence and its Implication for the Rule of Combination](http://people.eecs.berkeley.edu/~zadeh/papers/Dempster-Shafer_1986.pdf) • [Beyond Breakpoints: A Tour of Dynamic Analysis](https://github.com/dijkstracula/QConNYC2016) • [Trust by Verify: Accountability for Network Services](http://issg.cs.duke.edu/publications/trust-ew04.pdf)

<3 @grepory github.com/grepory/monitorama2016 opsee.com

Monitoring is Dead

Monitoring is Dead

More Decks by Greg Poirier

Other Decks in Technology

Featured

Transcript