Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tips & Tricks to Reduce TTR for the Next Incident

j.hand
March 24, 2015

Tips & Tricks to Reduce TTR for the Next Incident

Resolving an incident can be a complex process that takes a lot of time and many people. According to the 2014 State of On-Call Report, most teams report that it takes 10-30 minutes to resolve an incident and on average, 5 people are needed to help with resolution.

But it doesn’t have to be that way. In this presentation, Jason Hand will present best practices and tips for surviving every stage of the firefight - from when an alert comes in to pulling reports after it’s over. Join us to see how we do it at VictorOps.

j.hand

March 24, 2015
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. Jason Hand – DevOps Evangelist Tips & Tricks to Reduce

    TTR for the Next Incident @jasonhand
  2. Time to Resolution (TTR) •  The total amount of time

    taken to resolve an incident •  MTTR – Mean Time To Resolution* – summary over time – measurement used to describe the most "typical" value in a set of values – the lower the better *Resolve  =  Repair  =  Recover    
  3. Alerting “zero  1me”  aler1ng  pla6orm  to  find  people   instantly

     can  only  really  effect  average  TTR  by   a  very  small  percentage   No1fy  on-­‐call  members  
  4. Victor’s Tips “Include  useful  content  &  context  in  the  alerts.”

        “Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”    
  5. Victor’s Tips “Get  the  right  alerts  to  the  right  people

     through  rou8ng.”     “Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”    
  6. Investigation • Log  in   • Check  the  logs   • Analyze  metrics

      • Review  wikis   • Discuss  w/  team  
  7. Victor’s Tips “Collaborate  &  Share.”     “Connect  with  the

     right  resources  and  team  members.”    
  8. Resolution  Self-­‐documen1ng  what  teams  do  to  solve  the  problem  

    Bidirec1onal  integra1on  with  your  favorite  chat  client  and  the  VictorOps  1meline   Team  members   performing   system  ac1ons   to  fix  the   problem(s)      
  9. “Conduct  (blameless)  post-­‐mortems.”     “Be  vocal  &  share  what

     is  taking  place.”     “Provide  quick  access  to  accurate  metrics  &  runbooks.”     “Collaborate  &  Share.”     “Connect  with  the  right  resources  and  team  members.”     “Get  the  right  alerts  to  the  right  people  through  rou8ng.”     “Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”     “Include  useful  content  &  context  in  the  alerts.”     “Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”    
  10. Jason Hand – DevOps Evangelist Tips & Tricks to Reduce

    TTR for the Next Incident @jasonhand Thank  You   [email protected]