Upgrade to Pro — share decks privately, control downloads, hide ads and more …

5 Years of Metrics & Monitoring

5 Years of Metrics & Monitoring

Video of this talk from DevOpsDays Ghent: http://www.ustream.tv/recorded/54694069

---

5 years ago, monitoring was just beginning to emerge from the dark ages.

Since then there's been a Cambrian explosion of tools, a rough formalisation of how the tools should be strung together, the emergence of the #monitoringsucks meme, the transformation of #monitoringsucks into #monitoringlove, and the rise of a sister community around Monitorama.

Alert fatigue has become a concept that's entered the devops consciousness, and more advanced shops along the monitoring continuum are analysing their alerting data to help humans and machines work better together.

But Nagios is still the dominant check executor. Plenty of sites still use RRDtool. And plenty of people are still chained to their pagers, with no relief in sight.

What's holding us back? What will the next 5 years look like? Will we still be using Nagios? Have we misjudged our audience? What are our biggest challenges?

### Sources ###

Font: http://www.fontsquirrel.com/fonts/sketchetica
The Gospel of Graphs, according to Cleveland: http://www.amazon.com/Elements-Graphing-Data-William-Cleveland/dp/0963488414

Lindsay Holmwood

October 27, 2014
Tweet

More Decks by Lindsay Holmwood

Other Decks in Technology

Transcript

  1. 5 Years of
    Metrics &
    Monitoring
    Lindsay Holmwood
    @auxesis

    View full-size slide

  2. Cultural &
    Technical

    View full-size slide


  3. Key retrospective questions

    What did we do well?

    What did we learn?

    What should we do differently next time?

    What still puzzles us?

    View full-size slide

  4. What got us here
    won’t get us there

    View full-size slide

  5. What did we do well?
    (that if we don’t talk about, we might forget)

    View full-size slide

  6. The Pipeline

    View full-size slide

  7. storage
    collection

    View full-size slide

  8. storage checking
    collection

    View full-size slide

  9. storage checking alerting
    collection

    View full-size slide

  10. storage checking alerting
    collection
    graphing

    View full-size slide

  11. storage checking alerting
    collection
    graphing
    aggregation

    View full-size slide

  12. collection storage checking alerting
    graphing
    aggregation

    View full-size slide

  13. collection storage checking alerting
    graphing
    aggregation
    collectd &
    statsd

    View full-size slide

  14. collection storage checking alerting
    graphing
    aggregation
    Graphite &
    OpenTSDB &
    InfluxDB

    View full-size slide

  15. collection storage checking alerting
    graphing
    aggregation
    Riemann

    View full-size slide

  16. Alert fatigue
    has become a
    recognised
    problem

    View full-size slide

  17. Cottage industry

    View full-size slide

  18. PagerDuty &
    VictorOps &
    OpsGenie

    View full-size slide

  19. #monitoringsucks

    View full-size slide

  20. #monitoringlove

    View full-size slide

  21. What would we do
    differently next time?

    View full-size slide

  22. Graphs &
    Dashboards

    View full-size slide

  23. Apparently the hardest
    problem in monitoring is
    graphing and dashboarding.

    View full-size slide

  24. What we’re doing
    wrong

    View full-size slide

  25. Strip charts

    View full-size slide

  26. We have a problem

    View full-size slide

  27. Strip charts: the PHP hammer of graphing

    View full-size slide

  28. What can the
    data tell us?

    View full-size slide

  29. What is the
    distribution?

    View full-size slide

  30. It’s not a problem
    with the tools

    View full-size slide

  31. Our approach
    is tainted

    View full-size slide

  32. graphing problems we have
    graphing
    problems serviced
    by strip charts

    View full-size slide

  33. Basic graph layout

    View full-size slide

  34. Black on white

    View full-size slide

  35. bounding box with
    x + y axes labels
    1 2 3 4 5
    5
    3
    1
    5
    3
    1
    1 2 3 4 5

    View full-size slide

  36. Differential
    colour engine

    View full-size slide

  37. Maximum of 15
    colours on-screen

    View full-size slide

  38. Adjust saturation,
    not hue

    View full-size slide

  39. Use minimal hue
    to call out data

    View full-size slide

  40. Fucking Pie Charts

    View full-size slide

  41. Experiment:
    Compare segment sizes

    View full-size slide

  42. – William S. Cleveland, p.86 Principles of Graphing Data
    This allows us to see very clearly that the pie
    chart judgements are less accurate than the
    bar chart judgements.

    View full-size slide

  43. Pie chart comparisons
    are more error prone

    View full-size slide

  44. Pie not eaten
    Pie eaten
    The only time you should use a pie chart

    View full-size slide

  45. What did we learn?

    View full-size slide

  46. Democratisation of
    graphing tool
    development

    View full-size slide

  47. Scratch our itches

    View full-size slide

  48. Same poor UX,
    better paint job

    View full-size slide

  49. We get the graphing
    tools we deserve

    View full-size slide

  50. Nagios is
    here to stay
    (at least for ops)

    View full-size slide

  51. No
    strong, compelling
    alternative

    View full-size slide

  52. When I hear people say
    “I'm not using Sensu because it's too complex”
    I think
    “and Nagios isn't hiding the same complexity from you?”

    View full-size slide

  53. This is a problem

    View full-size slide

  54. We don’t know stats

    View full-size slide

  55. storage checking alerting
    collection
    graphing
    aggregation

    View full-size slide

  56. storage checking alerting
    collection
    graphing
    aggregation
    checks

    View full-size slide

  57. Numbers &
    Strings &
    Behaviour

    View full-size slide

  58. Fault detection
    (thresholding)

    View full-size slide

  59. Anomaly detection
    (trend analysis)

    View full-size slide

  60. Monitoring is
    CI for
    Production

    View full-size slide

  61. Continuous Integration

    View full-size slide

  62. 1. checkout
    Continuous Integration

    View full-size slide

  63. 1. checkout
    2. build
    Continuous Integration

    View full-size slide

  64. 1. checkout
    2. build
    3. test
    Continuous Integration

    View full-size slide

  65. 1. checkout
    2. build
    3. test
    4. notify
    Continuous Integration

    View full-size slide

  66. 1. checkout
    2. build
    3. test
    4. notify
    Continuous Integration
    Monitoring

    View full-size slide

  67. 1. checkout
    2. build
    3. test
    4. notify
    can I see my app?
    Continuous Integration
    Monitoring

    View full-size slide


  68. serverspec &

    sensu

    View full-size slide

  69. What still puzzles us?
    (or, what might the future look like?)

    View full-size slide

  70. The future is
    analysing &
    acting on our
    alert data

    View full-size slide


  71. Last 5 years

    Building new tools

    Formalising relationships

    Search for parallels in other industries

    Measuring the human impact

    View full-size slide


  72. Next

    Stabilisation of tools

    Emerging standards

    Exploiting parallels

    Mitigating the human impact

    View full-size slide

  73. Analysis:
    Ops Weekly

    View full-size slide

  74. Context:
    Nagios Herald

    View full-size slide

  75. The future is
    richer metadata
    about our metrics

    View full-size slide

  76. {
    server: dfs1
    what: diskspace
    mountpoint: srv/node/dfs10
    unit: B
    type: used
    metric_type: gauge
    }
    meta: {
    agent: diamond,
    processed_by: statsd2
    }

    View full-size slide

  77. Self-describing

    View full-size slide

  78. The future is
    richer metadata
    about our metrics

    View full-size slide

  79. The future is
    richer metadata
    about our metrics
    to automatically build
    appropriate
    visualisations

    View full-size slide


  80. Aggregation &

    Grouping &

    Unit conversions &

    Scaling &

    Axes labelling &


    View full-size slide

  81. Death to strip charts

    View full-size slide

  82. The future is
    monitoring tools
    for devs

    View full-size slide

  83. Ops must be enablers,
    not gatekeepers

    View full-size slide

  84. What has made sense
    about ops being
    gatekeepers?

    View full-size slide

  85. Monitoring is treated
    as an operational
    responsibility

    View full-size slide

  86. Ops team
    own ops

    View full-size slide

  87. We’ve won
    the battles

    View full-size slide

  88. Ops team
    own ops

    View full-size slide

  89. This is
    no longer
    the world
    we live in

    View full-size slide

  90. How do we
    become enablers?

    View full-size slide

  91. Technical
    &
    Cultural

    View full-size slide


  92. Technical

    View full-size slide


  93. Technical

    Ops provide the platform

    View full-size slide


  94. Technical

    Ops provide the platform

    Maintain, monitor, and scale the platform

    View full-size slide

  95. — Adrian Cockcroft

    View full-size slide


  96. Cultural

    Coach on what makes a good check

    Coach on what is good alert design

    Listen to the needs of the end-user

    View full-size slide

  97. Provide monitoring
    as a service

    View full-size slide

  98. Monitoring is a
    core deliverable
    on every service

    View full-size slide

  99. Ship checks & config
    with your applications

    View full-size slide

  100. Example: Yelp

    View full-size slide

  101. What’s the
    barrier
    to entry?

    View full-size slide

  102. Does the idea just
    not have traction?

    View full-size slide

  103. Are the tools
    not up to scratch?

    View full-size slide

  104. Does monitoring need to be
    SaaS (or SaaS-like) to make
    this achievable at scale?

    View full-size slide

  105. – William Gibson
    The future is here – it’s just
    not very evenly distributed

    View full-size slide

  106. Monitoring is
    still insular

    View full-size slide

  107. We’re building tools
    for operations teams

    View full-size slide

  108. Not the developers
    who need them most

    View full-size slide

  109. Monitoring is like a joke.

    View full-size slide

  110. Monitoring is like a joke.
    If you have to explain it,
    it’s not that good.

    View full-size slide

  111. storage checking alerting
    collection
    graphing
    aggregation

    View full-size slide

  112. What can we
    do better?

    View full-size slide

  113. I’m Lindsay
    @auxesis

    View full-size slide

  114. Dank je wel!

    View full-size slide

  115. Dank je wel!
    Liked the talk? Let @auxesis know.

    View full-size slide