Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How post-mortems can close the loop on IT metrics

516fcd20ab7b946f50090ce1d557638c?s=47 j.hand
August 19, 2015

How post-mortems can close the loop on IT metrics

The point of a postmortem is to learn. In this presentation, Jason Hand demonstrates how they also help close the (OODA) loop on IT Metrics. Originally a webinar, you can view and hear the recording here: https://victorops.com/knowledge-drop/webinars/post-mortems-can-close-loop-metrics/



August 19, 2015


  1. How Postmortems can close the loop on IT metrics @jasonhand

  2. Jason Hand victorops devops evangelist @jasonhand @jasonhand 2

  3. automate measure share learn @jasonhand 3

  4. automate @jasonhand 4

  5. measure @jasonhand 5

  6. share @jasonhand 6

  7. learn @jasonhand 7

  8. increase feedback Shortening feedback loops leads to... @jasonhand 8

  9. continuous... integration (of code) deployment (of software/product) and ... @jasonhand

  10. improvement @jasonhand 10

  11. OODA loop Observe Orient Decide Act by: John Boyd @jasonhand

  12. why we measure @jasonhand 12

  13. observe & orient @jasonhand 13

  14. which leads to deciding & acting @jasonhand 14

  15. What metrics should we be looking at? CPU, Memory, Network,

    and Disk metrics (duh) @jasonhand 15
  16. but wait! there's more @jasonhand 16

  17. look between the spaces @jasonhand 17

  18. understand what we're building @jasonhand 18

  19. between the spaces look for metrics to help: —weed out

    edge case scenarios —confirm assumptions @jasonhand 19
  20. should we? @jasonhand 20

  21. actually.. more like.. @jasonhand 21

  22. collect all the things @jasonhand 22

  23. "We don't know what the context is and we don't

    know what may be interesting to us in the future. But if something goes down and we don't have metrics for it, we have no perspective... and that's probably the worst-case scenario." - Jason Dixon (librato.com & monitorama) The Ship Show (podcast): Episode 56 @jasonhand 23
  24. Fact: measuring and looking at data all the time isn't

    that helpful. But: when you need it to understand a problem, you'll want it then. @jasonhand 24
  25. during an incident leverage metrics to observe & orient to

    help you decide & take action @jasonhand 25
  26. all to help: - triage - investigate - identify what's

    happening. @jasonhand 26
  27. why do we do postmortems? @jasonhand 27

  28. Learn... which leads to decisions and action @jasonhand 28

  29. get your story straight understand the story of what took

    place following an incident. @jasonhand 29
  30. accountability & empathy By sharing an accurate account and attempting

    to understand and empathize around exactly what took place, teams can learn from that incident and improve their processes. @jasonhand 30
  31. how do we do postmortems @jasonhand 31

  32. capture everything and in one place @jasonhand 32

  33. victorops timeline @jasonhand 33

  34. chat slack hipchat @jasonhand 34

  35. blameless remove blame and go after the facts @jasonhand 35

  36. The idea is to learn @jasonhand 36

  37. punish @jasonhand 37

  38. remediation items "Learning from a postmortem is only as useful

    as what you put into practice afterwards and we realized that without any action items after the meeting, it was more or less just a Greek Senate debate" —Ben VanEvery (box.com) @jasonhand 38
  39. assign ownership it's not about sharing feelings and theories without

    accomplishing anything @jasonhand 39
  40. tie it with a bow Now you have a very

    accurate story @jasonhand 40
  41. a story telling us ... @jasonhand 41

  42. monitoring data @jasonhand 42

  43. who was alerted @jasonhand 43

  44. how quickly they responded @jasonhand 44

  45. who was involved throughout the incident management lifecycle @jasonhand 45

  46. conversations that were had @jasonhand 46

  47. commands that were run (i.e. ChatOps) @jasonhand 47

  48. context @jasonhand 48

  49. Memorialized - monitoring data - alerts - acknowledgments - context

    . graphs . logs . runbooks . notes - actions - conversations - remediation @jasonhand 49
  50. loop closed We've taken the important monitoring data and metrics

    .. start back over @jasonhand 50
  51. Jason Hand @jasonhand @victorops @jasonhand 51

  52. Chatops For Dummies jhand.co/ChatOps4Dummies @jasonhand 52