Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lightning Talk - Tips & Tricks to Reducing MTTR

Lightning Talk - Tips & Tricks to Reducing MTTR

Lighting Talk given at Austin DevOpsDays 2015 on simple tricks to reducing MTTR during incident management

j.hand

May 05, 2015
Tweet

More Decks by j.hand

Other Decks in Technology

Transcript

  1. Time to Resolution (TTR) •  The total amount of time

    taken to resolve an incident •  MTTR – Mean Time To Resolution* – summary over time – measurement used to describe the most "typical" value in a set of values – the lower the better *Resolve  =  Repair  =  Recover    
  2. Alerting No0fy  on-­‐call  members   The  smallest  por0on  of  0me

     is  spent     “being  alerted  to  the  problem”  
  3. Victor’s Tips “Include  useful  content  &  context  in  the  alerts.”

      “Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”  
  4. Triage Ini0al  understanding  of  what’s  happening  and  assign  degrees  of

     urgency  to  incidents   “I  know  there’s  a  problem  but  I  have  no  idea  who  or  what  is  affected.”   Timeline  (single  source  of  truth),  Intelligent  Rou0ng  
  5. Victor’s Tips “Get  the  right  alerts  to  the  right  people

     through  rou8ng.”   “Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”  
  6. Investigation • Log  in   • Check  the  logs   • Analyze  metrics

      • Review  wikis   • Discuss  w/  team   The  majority  of  TTR,  a  full  40%  
  7. Identification “Everything  will  be  beSer  if  I  fix  this  one

     thing.”   Know  where  to  find  your  runbooks,     checklists,  wikis,  or  steps     to  work  through  a  problem  
  8. Resolution  Self-­‐documen0ng  what  teams  do  to  solve  the  problem  

    Team  members   performing   system  ac0ons   to  fix  the   problem(s)       #ChatOps   Communicate!  Communicate!  Communicate!  
  9. Tips & Tricks to Reduce TTR for the Next Incident

    Summary “Connect  with  the  right  resources  and  team  members.”   “Get  the  right  alerts  to  the  right  people  through  rou8ng.”   “Establish  a  single  source  of  truth  for  all  ac8vi8es  of  an  incident.”   “Include  useful  content  &  context  in  the  alerts.”   “Use  custom  no8fica8ons  to  dis8nguish  cri8cal  alerts.”  
  10. Tips & Tricks to Reduce TTR for the Next Incident

    Summary “Conduct  (blameless)  post-­‐mortems.”   “Be  vocal  &  share  what  is  taking  place.”   “Provide  quick  access  to  accurate  metrics  &  runbooks.”   “Collaborate  &  Share.”  
  11. Jason Hand – DevOps Evangelist Tips & Tricks to Reduce

    TTR for the Next Incident @jasonhand Thank  You   [email protected]