Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring is Dead

Monitoring is Dead

Monitoring systems have not changed significantly in 20 years and has fallen behind the way be build software. Our software is now large distributed systems made up of many non-uniform interacting components while the core functionality of monitoring systems has stagnated. Furthermore, it is often people without expert knowledge of systems under observation that are responsible for monitoring and operating them. In this talk, we will explore how our current monitoring capabilities are failing us and discuss how we can build systems that are both reliable and observable while making our lives (or the lives of the people responsible for their operations in production) easier.

Greg Poirier

June 27, 2016
Tweet

More Decks by Greg Poirier

Other Decks in Technology

Transcript

  1. MONITORING
    IS DEAD

    View Slide

  2. Greg Poirier
    CTO @ opsee.com
    @grepory

    View Slide

  3. A BRIEF
    HISTORY

    View Slide

  4. The big five

    View Slide

  5. CPU

    View Slide

  6. uptime |
    mailx -s “cpu” root

    View Slide

  7. Memory

    View Slide

  8. free |
    mailx -s “mem” root

    View Slide

  9. Disk

    View Slide

  10. (df -h; du -sh /home/*)|
    mailx -s “disk” root

    View Slide

  11. Process aliveness

    View Slide

  12. (ps -ef | grep important)|
    mailx -s root

    View Slide

  13. System aliveness

    View Slide

  14. ping -c 4 giganews.com |
    mailx -s root

    View Slide

  15. View Slide

  16. Thresholds for
    the threshold god.

    View Slide

  17. OK:
    it < something

    View Slide

  18. WARN:
    something < it < something

    View Slide

  19. CRITICAL:
    it > something

    View Slide

  20. What’s the problem?

    View Slide

  21. You're either a one or a zero.
    Alive or dead.

    View Slide

  22. But Greeeeeegggg…

    View Slide

  23. Time-series data!

    View Slide

  24. Derivatives!

    View Slide

  25. Percentiles!

    View Slide

  26. #slowclap

    View Slide

  27. Everything we know
    is wrong.

    View Slide

  28. What the hell is
    going on?

    View Slide

  29. Calm the hell down,
    friends.

    View Slide

  30. What is this?
    DevOps?

    View Slide

  31. Let’s change the
    conversation.

    View Slide

  32. A LINE
    IN THE
    SAND

    View Slide

  33. What is monitoring?

    View Slide

  34. Observability

    View Slide

  35. A system is observable iff you
    can determine the behavior of
    the system based on its outputs.

    View Slide

  36. A system is observable iff you
    can determine the behavior of
    the system based on its outputs.

    View Slide

  37. A system is a set of
    connected components.

    View Slide

  38. A system is observable iff you
    can determine the behavior of
    the system based on its outputs.

    View Slide

  39. The manner in which a
    system acts is its
    behavior.

    View Slide

  40. A system is observable iff you
    can determine the behavior of
    the system based on its outputs.

    View Slide

  41. The outputs of a system
    are the concrete results
    of its behaviors.

    View Slide

  42. A system is observable iff you
    can determine the behavior of
    the system based on its outputs.

    View Slide

  43. What about
    monitoring?

    View Slide

  44. A system is a set of
    connected components.

    View Slide

  45. One or more sensors
    observe the state of a
    component.

    View Slide

  46. An agent interprets data
    emitted by a sensor.

    View Slide

  47. JFC, Greg.
    What is monitoring?

    View Slide

  48. Monitoring is the action of
    observing and checking the
    behavior and outputs of a system
    and its components over time.

    View Slide

  49. LET US
    DO THIS

    View Slide

  50. Things change.

    View Slide

  51. Fault detection

    View Slide

  52. The ‘FLP result’ [1]

    View Slide

  53. Byzantine Generals
    Problem [2]

    View Slide

  54. Respond too slowly/
    Fail to respond [3][4]

    View Slide

  55. Service Level Objectives

    View Slide

  56. Better health checks

    View Slide

  57. DevOps this shit up.

    View Slide

  58. Monitoring is part of
    building software.

    View Slide

  59. .monitor.yml:
    metrics:
    - metric: "nsq.queue_length.results"
    assertions:
    - comparison: "> 0"
    time: "5m"
    http_checks:
    - method: "GET"
    path: "/health"
    assertions:
    - body_json: ".database.health = true"
    code: 200
    rtt: "10ms"

    View Slide

  60. Understand how your
    systems behave.

    View Slide

  61. Build better tools.

    View Slide

  62. Think about
    distributed systems.

    View Slide

  63. • 1. Fischer, M. [Impossibility of Distributed Concensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). in Journal of the Association for Computing Machinery, Vol. 32, No. 2, April
    1985, pp. 374-382.
    • 2. Lamport, L., Shostak, R., and Pease, M. [The Byzantine Generals Problem](http://research.microsoft.com/en-us/um/people/lamport/pubs/byz.pdf). in ACM Transactions on Programming Languages and Systems, Vol.
    4, No. 3, July 1982, Pages 382-401.
    • 3. Poledna, S., Burns, A., Wellings, A., and Barrett, P. [Replica Determinism and Flexible Scheduling in Hard Real-Time Dependable Systems](https://people.cs.pitt.edu/~melhem/courses/3530/papers/ft5.pdf). in IEEE
    Transactions on Computers, Vol. 49, No. 2, February 2000, Pages 100-111.
    • 4. Videla, A. [Failure Modes in Distributed Systems](http://videlalvaro.github.io/2013/12/failure-modes-in-distributed-systems.html). in his blog, December 2013.
    • [A Brief Tour of FLP Impossibility](http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/)
    • [Fault Management in Distributed Systems](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1960&context=cis_reports)
    • [The Case for Byzantine Fault Detection](https://www.usenix.org/legacy/event/hotdep06/tech/prelim_papers/haeberlen/haeberlen_html/index.html)
    • [Fail-Stop Processors](https://www.cs.cornell.edu/fbs/publications/FailStop.pdf)
    • [The Phi Accrual Failure Detector](http://fubica.lsd.ufcg.edu.br/hp/cursos/cfsc/papers/hayashibara04theaccrual.pdf)
    • [GEMS: Gossip-Enabled Monitoring Service for Scalable Heterogeneous Distributed Systems](http://docs.hcs.ufl.edu/pubs/GEMS2005.pdf)
    • [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf)
    • [A Fault Detection Service for Wide Area Distributed Computations](http://toolkit.globus.org/ftppub/globus/papers/hbm.pdf)
    • [Fault Detection and Identification in Computer Networks: A Soft Computing Approach](https://uwspace.uwaterloo.ca/bitstream/handle/10012/4905/NFMS_PhD_Thesis_Jan7.pdf)
    • [BAR Fault Tolerance for Cooperative Services](https://www.cs.utexas.edu/~lorenzo/papers/sosp05.pdf)
    • [PeerReview: Practical Accountability for Distributed Systems](http://www.sosp2007.org/papers/sosp118-haeberlen.pdf)
    • [The Verification of a Distributed System](http://queue.acm.org/detail.cfm?id=2889274)
    • [Practical Byzantine Fault Tolerance and Proactive Recovery](http://research.microsoft.com/en-us/um/people/mcastro/publications/p398-castro-bft-tocs.pdf)
    • [Accrual Failure Detectors](http://www.jaist.ac.jp/jinzai/Report16/Hayashibara.pdf)
    • [Consistency in a Partitioned Network: A Survey](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1669&context=cis_reports)
    • [A Gossip-Style Failure Detection Service](https://www.cs.cornell.edu/home/rvr/papers/GossipFD.pdf)
    • [SWIM: Scalable Weakly-Consistent Infection-style Process Group Membership Protocol](https://www.cs.cornell.edu/~asdas/research/dsn02-swim.pdf)
    • [Adaptive Diagnosis in Distributed Systems](http://www.research.ibm.com/people/r/rish/papers/IEEE.pdf)
    • [Dempster-Shafer theory](https://en.wikipedia.org/wiki/Dempster%E2%80%93Shafer_theory)
    • [A Simpley View of the Dempster-Shafer Theory of Evidence and its Implication for the Rule of Combination](http://people.eecs.berkeley.edu/~zadeh/papers/Dempster-Shafer_1986.pdf)
    • [Beyond Breakpoints: A Tour of Dynamic Analysis](https://github.com/dijkstracula/QConNYC2016)
    • [Trust by Verify: Accountability for Network Services](http://issg.cs.duke.edu/publications/trust-ew04.pdf)

    View Slide

  64. <3
    @grepory
    github.com/grepory/monitorama2016
    opsee.com

    View Slide