Upgrade to Pro — share decks privately, control downloads, hide ads and more …

5 Years of Metrics & Monitoring

5 Years of Metrics & Monitoring

Video of this talk from DevOpsDays Ghent: http://www.ustream.tv/recorded/54694069

---

5 years ago, monitoring was just beginning to emerge from the dark ages.

Since then there's been a Cambrian explosion of tools, a rough formalisation of how the tools should be strung together, the emergence of the #monitoringsucks meme, the transformation of #monitoringsucks into #monitoringlove, and the rise of a sister community around Monitorama.

Alert fatigue has become a concept that's entered the devops consciousness, and more advanced shops along the monitoring continuum are analysing their alerting data to help humans and machines work better together.

But Nagios is still the dominant check executor. Plenty of sites still use RRDtool. And plenty of people are still chained to their pagers, with no relief in sight.

What's holding us back? What will the next 5 years look like? Will we still be using Nagios? Have we misjudged our audience? What are our biggest challenges?

### Sources ###

Font: http://www.fontsquirrel.com/fonts/sketchetica
The Gospel of Graphs, according to Cleveland: http://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414

Lindsay Holmwood

October 27, 2014
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. 5 Years of
    Metrics &
    Monitoring
    Lindsay Holmwood
    @auxesis

    View Slide

  2. Cultural &
    Technical

    View Slide


  3. Key retrospective questions

    What did we do well?

    What did we learn?

    What should we do differently next time?

    What still puzzles us?

    View Slide

  4. What got us here
    won’t get us there

    View Slide

  5. What did we do well?
    (that if we don’t talk about, we might forget)

    View Slide

  6. The Pipeline

    View Slide

  7. View Slide

  8. collection

    View Slide

  9. storage
    collection

    View Slide

  10. storage checking
    collection

    View Slide

  11. storage checking alerting
    collection

    View Slide

  12. storage checking alerting
    collection
    graphing

    View Slide

  13. storage checking alerting
    collection
    graphing
    aggregation

    View Slide

  14. View Slide

  15. collection storage checking alerting
    graphing
    aggregation

    View Slide

  16. collection storage checking alerting
    graphing
    aggregation
    collectd &
    statsd

    View Slide

  17. collection storage checking alerting
    graphing
    aggregation
    Graphite &
    OpenTSDB &
    InfluxDB

    View Slide

  18. collection storage checking alerting
    graphing
    aggregation
    Riemann

    View Slide

  19. Alert fatigue
    has become a
    recognised
    problem

    View Slide

  20. Cottage industry

    View Slide

  21. PagerDuty &
    VictorOps &
    OpsGenie

    View Slide

  22. #monitoringsucks

    View Slide

  23. #monitoringlove

    View Slide

  24. View Slide

  25. What would we do
    differently next time?

    View Slide

  26. Graphs &
    Dashboards

    View Slide

  27. Apparently the hardest
    problem in monitoring is
    graphing and dashboarding.

    View Slide

  28. What we’re doing
    wrong

    View Slide

  29. Strip charts

    View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. We have a problem

    View Slide

  34. Strip charts: the PHP hammer of graphing

    View Slide

  35. What can the
    data tell us?

    View Slide

  36. What is the
    distribution?

    View Slide

  37. It’s not a problem
    with the tools

    View Slide

  38. Our approach
    is tainted

    View Slide

  39. graphing problems we have
    graphing
    problems serviced
    by strip charts

    View Slide

  40. View Slide

  41. Basic graph layout

    View Slide

  42. Black on white

    View Slide

  43. bounding box with
    x + y axes labels
    1 2 3 4 5
    5
    3
    1
    5
    3
    1
    1 2 3 4 5

    View Slide

  44. Colour

    View Slide

  45. Differential
    colour engine

    View Slide

  46. View Slide

  47. Maximum of 15
    colours on-screen

    View Slide

  48. 8%

    View Slide

  49. Adjust saturation,
    not hue

    View Slide

  50. Use minimal hue
    to call out data

    View Slide

  51. View Slide

  52. Fucking Pie Charts

    View Slide

  53. View Slide

  54. Experiment:
    Compare segment sizes

    View Slide

  55. View Slide

  56. View Slide

  57. View Slide

  58. View Slide

  59. – William S. Cleveland, p.86 Principles of Graphing Data
    This allows us to see very clearly that the pie
    chart judgements are less accurate than the
    bar chart judgements.

    View Slide

  60. Pie chart comparisons
    are more error prone

    View Slide

  61. Pie not eaten
    Pie eaten
    The only time you should use a pie chart

    View Slide

  62. View Slide

  63. What did we learn?

    View Slide

  64. Democratisation of
    graphing tool
    development

    View Slide

  65. Scratch our itches

    View Slide

  66. Same poor UX,
    better paint job

    View Slide

  67. View Slide

  68. View Slide

  69. We get the graphing
    tools we deserve

    View Slide

  70. Nagios is
    here to stay
    (at least for ops)

    View Slide

  71. Inertia

    View Slide

  72. No
    strong, compelling
    alternative

    View Slide

  73. Sensu

    View Slide

  74. When I hear people say
    “I'm not using Sensu because it's too complex”
    I think
    “and Nagios isn't hiding the same complexity from you?”

    View Slide

  75. This is a problem

    View Slide

  76. View Slide

  77. We don’t know stats

    View Slide

  78. storage checking alerting
    collection
    graphing
    aggregation

    View Slide

  79. storage checking alerting
    collection
    graphing
    aggregation
    checks

    View Slide

  80. Numbers &
    Strings &
    Behaviour

    View Slide

  81. Numbers

    View Slide

  82. Fault detection
    (thresholding)

    View Slide

  83. Anomaly detection
    (trend analysis)

    View Slide

  84. Monitoring is
    CI for
    Production

    View Slide

  85. Continuous Integration

    View Slide

  86. 1. checkout
    Continuous Integration

    View Slide

  87. 1. checkout
    2. build
    Continuous Integration

    View Slide

  88. 1. checkout
    2. build
    3. test
    Continuous Integration

    View Slide

  89. 1. checkout
    2. build
    3. test
    4. notify
    Continuous Integration

    View Slide

  90. 1. checkout
    2. build
    3. test
    4. notify
    Continuous Integration
    Monitoring

    View Slide

  91. 1. checkout
    2. build
    3. test
    4. notify
    can I see my app?
    Continuous Integration
    Monitoring

    View Slide


  92. serverspec &

    sensu

    View Slide

  93. View Slide

  94. What still puzzles us?
    (or, what might the future look like?)

    View Slide

  95. The future is
    analysing &
    acting on our
    alert data

    View Slide


  96. Last 5 years

    Building new tools

    Formalising relationships

    Search for parallels in other industries

    Measuring the human impact

    View Slide


  97. Next

    Stabilisation of tools

    Emerging standards

    Exploiting parallels

    Mitigating the human impact

    View Slide

  98. Analysis:
    Ops Weekly

    View Slide

  99. View Slide

  100. View Slide

  101. Context:
    Nagios Herald

    View Slide

  102. View Slide

  103. View Slide

  104. The future is
    richer metadata
    about our metrics

    View Slide

  105. Metrics 2.0

    View Slide

  106. {
    server: dfs1
    what: diskspace
    mountpoint: srv/node/dfs10
    unit: B
    type: used
    metric_type: gauge
    }
    meta: {
    agent: diamond,
    processed_by: statsd2
    }

    View Slide

  107. Self-describing

    View Slide

  108. The future is
    richer metadata
    about our metrics

    View Slide

  109. The future is
    richer metadata
    about our metrics
    to automatically build
    appropriate
    visualisations

    View Slide


  110. Aggregation &

    Grouping &

    Unit conversions &

    Scaling &

    Axes labelling &


    View Slide

  111. Death to strip charts

    View Slide

  112. The future is
    monitoring tools
    for devs

    View Slide

  113. Ops must be enablers,
    not gatekeepers

    View Slide

  114. What has made sense
    about ops being
    gatekeepers?

    View Slide

  115. Monitoring is treated
    as an operational
    responsibility

    View Slide

  116. Ops team
    own ops

    View Slide

  117. We’ve won
    the battles

    View Slide

  118. Ops team
    own ops

    View Slide

  119. This is
    no longer
    the world
    we live in

    View Slide

  120. How do we
    become enablers?

    View Slide

  121. Technical
    &
    Cultural

    View Slide

  122. View Slide


  123. Technical

    View Slide


  124. Technical

    Ops provide the platform

    View Slide


  125. Technical

    Ops provide the platform

    Maintain, monitor, and scale the platform

    View Slide

  126. — Adrian Cockcroft

    View Slide

  127. View Slide


  128. Cultural

    View Slide


  129. Cultural

    Coach on what makes a good check

    Coach on what is good alert design

    Listen to the needs of the end-user

    View Slide

  130. Provide monitoring
    as a service

    View Slide

  131. Monitoring is a
    core deliverable
    on every service

    View Slide

  132. Ship checks & config
    with your applications

    View Slide

  133. Example: Yelp

    View Slide

  134. View Slide

  135. What’s the
    barrier
    to entry?

    View Slide

  136. Does the idea just
    not have traction?

    View Slide

  137. Are the tools
    not up to scratch?

    View Slide

  138. Does monitoring need to be
    SaaS (or SaaS-like) to make
    this achievable at scale?

    View Slide

  139. – William Gibson
    The future is here – it’s just
    not very evenly distributed

    View Slide

  140. Monitoring is
    still insular

    View Slide

  141. We’re building tools
    for operations teams

    View Slide

  142. Not the developers
    who need them most

    View Slide

  143. View Slide

  144. Monitoring is like a joke.

    View Slide

  145. Monitoring is like a joke.
    If you have to explain it,
    it’s not that good.

    View Slide

  146. storage checking alerting
    collection
    graphing
    aggregation

    View Slide

  147. What can we
    do better?

    View Slide

  148. I’m Lindsay
    @auxesis

    View Slide

  149. Dank je wel!

    View Slide

  150. Dank je wel!
    Liked the talk? Let @auxesis know.

    View Slide