Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring is Dead

Monitoring is Dead

Monitoring systems have not changed significantly in 20 years and has fallen behind the way be build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated. Furthermore, it is often people without expert knowledge of systems under observation that are responsible for monitoring and operating them. In this talk, we will explore how our current monitoring capabilities are failing us and discuss how we can build systems that are both reliable and observable while making our lives (or the lives of the people responsible for their operations in production) easier.


Greg Poirier

June 27, 2016



  2. Greg Poirier CTO @ opsee.com @grepory


  4. The big five

  5. CPU

  6. uptime | mailx -s “cpu” root

  7. Memory

  8. free | mailx -s “mem” root

  9. Disk

  10. (df -h; du -sh /home/*)| mailx -s “disk” root

  11. Process aliveness

  12. (ps -ef | grep important)| mailx -s root

  13. System aliveness

  14. ping -c 4 giganews.com | mailx -s root

  15. None
  16. Thresholds for the threshold god.

  17. OK: it < something

  18. WARN: something < it < something

  19. CRITICAL: it > something

  20. What’s the problem?

  21. You're either a one or a zero. Alive or dead.

  22. But Greeeeeegggg…

  23. Time-series data!

  24. Derivatives!

  25. Percentiles!

  26. #slowclap

  27. Everything we know is wrong.

  28. What the hell is going on?

  29. Calm the hell down, friends.

  30. What is this? DevOps?

  31. Let’s change the conversation.


  33. What is monitoring?

  34. Observability

  35. A system is observable iff you can determine the behavior

    of the system based on its outputs.
  36. A system is observable iff you can determine the behavior

    of the system based on its outputs.
  37. A system is a set of connected components.

  38. A system is observable iff you can determine the behavior

    of the system based on its outputs.
  39. The manner in which a system acts is its behavior.

  40. A system is observable iff you can determine the behavior

    of the system based on its outputs.
  41. The outputs of a system are the concrete results of

    its behaviors.
  42. A system is observable iff you can determine the behavior

    of the system based on its outputs.
  43. What about monitoring?

  44. A system is a set of connected components.

  45. One or more sensors observe the state of a component.

  46. An agent interprets data emitted by a sensor.

  47. JFC, Greg. What is monitoring?

  48. Monitoring is the action of observing and checking the behavior

    and outputs of a system and its components over time.

  50. Things change.

  51. Fault detection

  52. The ‘FLP result’ [1]

  53. Byzantine Generals Problem [2]

  54. Respond too slowly/ Fail to respond [3][4]

  55. Service Level Objectives

  56. Better health checks

  57. DevOps this shit up.

  58. Monitoring is part of building software.

  59. .monitor.yml: metrics: - metric: "nsq.queue_length.results" assertions: - comparison: "> 0"

    time: "5m" http_checks: - method: "GET" path: "/health" assertions: - body_json: ".database.health = true" code: 200 rtt: "10ms"
  60. Understand how your systems behave.

  61. Build better tools.

  62. Think about distributed systems.

  63. • 1. Fischer, M. [Impossibility of Distributed Concensus with One

    Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April 1985, pp. 374-382. • 2. Lamport, L., Shostak, R., and Pease, M. [The Byzantine Generals Problem](http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf). in ACM Transactions on Programming Languages and Systems, Vol. 4, No. 3, July 1982, Pages 382-401. • 3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. [Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems](https://people.cs.pitt.edu/~melhem/courses/3530/papers/ft5.pdf). in IEEE Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111. • 4. Videla, A. [Failure Modes in Distributed Systems](http://videlalvaro.github.io/2013/12/failure-modes-in-distributed-systems.html). in his blog, December 2013. • [A Brief Tour of FLP Impossibility](http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/) • [Fault Management in Distributed Systems](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1960&context=cis_reports) • [The Case for Byzantine Fault Detection](https://www.usenix.org/legacy/event/hotdep06/tech/prelim_papers/haeberlen/haeberlen_html/index.html) • [Fail-Stop Processors](https://www.cs.cornell.edu/fbs/publications/FailStop.pdf) • [The Phi Accrual Failure Detector](http://fubica.lsd.ufcg.edu.br/hp/cursos/cfsc/papers/hayashibara04theaccrual.pdf) • [GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems](http://docs.hcs.ufl.edu/pubs/GEMS2005.pdf) • [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf) • [A Fault Detection Service for Wide Area Distributed Computations](http://toolkit.globus.org/ftppub/globus/papers/hbm.pdf) • [Fault Detection and Identification in Computer Networks: A Soft Computing Approach](https://uwspace.uwaterloo.ca/bitstream/handle/10012/4905/NFMS_PhD_Thesis_Jan7.pdf) • [BAR Fault Tolerance for Cooperative Services](https://www.cs.utexas.edu/~lorenzo/papers/sosp05.pdf) • [PeerReview: Practical Accountability for Distributed Systems](http://www.sosp2007.org/papers/sosp118-haeberlen.pdf) • [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274) • [Practical Byzantine Fault Tolerance and Proactive Recovery](http://research.microsoft.com/en-us/um/people/mcastro/publications/p398-castro-bft-tocs.pdf) • [Accrual Failure Detectors](http://www.jaist.ac.jp/jinzai/Report16/Hayashibara.pdf) • [Consistency in a Partitioned Network: A Survey](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1669&context=cis_reports) • [A Gossip-Style Failure Detection Service](https://www.cs.cornell.edu/home/rvr/papers/GossipFD.pdf) • [SWIM: Scalable Weakly-Consistent Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf) • [Adaptive Diagnosis in Distributed Systems](http://www.research.ibm.com/people/r/rish/papers/IEEE.pdf) • [Dempster-Shafer theory](https://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory) • [A Simpley View of the Dempster-Shafer Theory of Evidence and its Implication for the Rule of Combination](http://people.eecs.berkeley.edu/~zadeh/papers/Dempster-Shafer_1986.pdf) • [Beyond Breakpoints: A Tour of Dynamic Analysis](https://github.com/dijkstracula/QConNYC2016) • [Trust by Verify: Accountability for Network Services](http://issg.cs.duke.edu/publications/trust-ew04.pdf)
  64. <3 @grepory github.com/grepory/monitorama2016 opsee.com