5 Years of Metrics & Monitoring

5 Years of Metrics & Monitoring

Video of this talk from DevOpsDays Ghent: http://www.ustream.tv/recorded/54694069

---

5 years ago, monitoring was just beginning to emerge from the dark ages.

Since then there's been a Cambrian explosion of tools, a rough formalisation of how the tools should be strung together, the emergence of the #monitoringsucks meme, the transformation of #monitoringsucks into #monitoringlove, and the rise of a sister community around Monitorama.

Alert fatigue has become a concept that's entered the devops consciousness, and more advanced shops along the monitoring continuum are analysing their alerting data to help humans and machines work better together.

But Nagios is still the dominant check executor. Plenty of sites still use RRDtool. And plenty of people are still chained to their pagers, with no relief in sight.

What's holding us back? What will the next 5 years look like? Will we still be using Nagios? Have we misjudged our audience? What are our biggest challenges?

### Sources ###

Font: http://www.fontsquirrel.com/fonts/sketchetica
The Gospel of Graphs, according to Cleveland: http://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414

Fad1e9ed293fc5b3ec7d4abdffeb636f?s=128

Lindsay Holmwood

October 27, 2014
Tweet

Transcript

  1. 5 Years of Metrics & Monitoring Lindsay Holmwood @auxesis

  2. Cultural & Technical

  3. • Key retrospective questions • What did we do well?

    • What did we learn? • What should we do differently next time? • What still puzzles us?
  4. What got us here won’t get us there

  5. What did we do well? (that if we don’t talk

    about, we might forget)
  6. The Pipeline

  7. None
  8. collection

  9. storage collection

  10. storage checking collection

  11. storage checking alerting collection

  12. storage checking alerting collection graphing

  13. storage checking alerting collection graphing aggregation

  14. None
  15. collection storage checking alerting graphing aggregation

  16. collection storage checking alerting graphing aggregation collectd & statsd

  17. collection storage checking alerting graphing aggregation Graphite & OpenTSDB &

    InfluxDB
  18. collection storage checking alerting graphing aggregation Riemann

  19. Alert fatigue has become a recognised problem

  20. Cottage industry

  21. PagerDuty & VictorOps & OpsGenie

  22. #monitoringsucks

  23. #monitoringlove

  24. None
  25. What would we do differently next time?

  26. Graphs & Dashboards

  27. Apparently the hardest problem in monitoring is graphing and dashboarding.

  28. What we’re doing wrong

  29. Strip charts

  30. None
  31. None
  32. None
  33. We have a problem

  34. Strip charts: the PHP hammer of graphing

  35. What can the data tell us?

  36. What is the distribution?

  37. It’s not a problem with the tools

  38. Our approach is tainted

  39. graphing problems we have graphing problems serviced by strip charts

  40. None
  41. Basic graph layout

  42. Black on white

  43. bounding box with x + y axes labels 1 2

    3 4 5 5 3 1 5 3 1 1 2 3 4 5
  44. Colour

  45. Differential colour engine

  46. None
  47. Maximum of 15 colours on-screen

  48. 8%

  49. Adjust saturation, not hue

  50. Use minimal hue to call out data

  51. None
  52. Fucking Pie Charts

  53. None
  54. Experiment: Compare segment sizes

  55. None
  56. None
  57. None
  58. None
  59. – William S. Cleveland, p.86 Principles of Graphing Data This

    allows us to see very clearly that the pie chart judgements are less accurate than the bar chart judgements.
  60. Pie chart comparisons are more error prone

  61. Pie not eaten Pie eaten The only time you should

    use a pie chart
  62. None
  63. What did we learn?

  64. Democratisation of graphing tool development

  65. Scratch our itches

  66. Same poor UX, better paint job

  67. None
  68. None
  69. We get the graphing tools we deserve

  70. Nagios is here to stay (at least for ops)

  71. Inertia

  72. No strong, compelling alternative

  73. Sensu

  74. When I hear people say “I'm not using Sensu because

    it's too complex” I think “and Nagios isn't hiding the same complexity from you?”
  75. This is a problem

  76. None
  77. We don’t know stats

  78. storage checking alerting collection graphing aggregation

  79. storage checking alerting collection graphing aggregation checks

  80. Numbers & Strings & Behaviour

  81. Numbers

  82. Fault detection (thresholding)

  83. Anomaly detection (trend analysis)

  84. Monitoring is CI for Production

  85. Continuous Integration

  86. 1. checkout Continuous Integration

  87. 1. checkout 2. build Continuous Integration

  88. 1. checkout 2. build 3. test Continuous Integration

  89. 1. checkout 2. build 3. test 4. notify Continuous Integration

  90. 1. checkout 2. build 3. test 4. notify Continuous Integration

    Monitoring
  91. 1. checkout 2. build 3. test 4. notify can I

    see my app? Continuous Integration Monitoring
  92. • serverspec & • sensu

  93. None
  94. What still puzzles us? (or, what might the future look

    like?)
  95. The future is analysing & acting on our alert data

  96. • Last 5 years • Building new tools • Formalising

    relationships • Search for parallels in other industries • Measuring the human impact
  97. • Next • Stabilisation of tools • Emerging standards •

    Exploiting parallels • Mitigating the human impact
  98. Analysis: Ops Weekly

  99. None
  100. None
  101. Context: Nagios Herald

  102. None
  103. None
  104. The future is richer metadata about our metrics

  105. Metrics 2.0

  106. { server: dfs1 what: diskspace mountpoint: srv/node/dfs10 unit: B type:

    used metric_type: gauge } meta: { agent: diamond, processed_by: statsd2 }
  107. Self-describing

  108. The future is richer metadata about our metrics

  109. The future is richer metadata about our metrics to automatically

    build appropriate visualisations
  110. • Aggregation & • Grouping & • Unit conversions &

    • Scaling & • Axes labelling & • …
  111. Death to strip charts

  112. The future is monitoring tools for devs

  113. Ops must be enablers, not gatekeepers

  114. What has made sense about ops being gatekeepers?

  115. Monitoring is treated as an operational responsibility

  116. Ops team own ops

  117. We’ve won the battles

  118. Ops team own ops

  119. This is no longer the world we live in

  120. How do we become enablers?

  121. Technical & Cultural

  122. None
  123. • Technical

  124. • Technical • Ops provide the platform

  125. • Technical • Ops provide the platform • Maintain, monitor,

    and scale the platform
  126. — Adrian Cockcroft

  127. None
  128. • Cultural

  129. • Cultural • Coach on what makes a good check

    • Coach on what is good alert design • Listen to the needs of the end-user
  130. Provide monitoring as a service

  131. Monitoring is a core deliverable on every service

  132. Ship checks & config with your applications

  133. Example: Yelp

  134. None
  135. What’s the barrier to entry?

  136. Does the idea just not have traction?

  137. Are the tools not up to scratch?

  138. Does monitoring need to be SaaS (or SaaS-like) to make

    this achievable at scale?
  139. – William Gibson The future is here – it’s just

    not very evenly distributed
  140. Monitoring is still insular

  141. We’re building tools for operations teams

  142. Not the developers who need them most

  143. None
  144. Monitoring is like a joke.

  145. Monitoring is like a joke. If you have to explain

    it, it’s not that good.
  146. storage checking alerting collection graphing aggregation

  147. What can we do better?

  148. I’m Lindsay @auxesis

  149. Dank je wel!

  150. Dank je wel! Liked the talk? Let @auxesis know.