Don't Burn Out the Night (SREcon 2016)

Don't Burn Out the Night (SREcon 2016)

It’s easy for engineers to overcommit to things, especially with On Call. Let’s build our On Call service that keeps people fresh and in it for the long haul.

In this session, Dave Dash will share lessons learned from Mozilla, Pinterest, and operations consulting on implementing a humane On Call service.

E3c6ff6229e3fe28f6dd008d8dc5ad04?s=128

Dave Dash

April 07, 2016
Tweet

Transcript

  1. DON’T BURN OUT THE NIGHT DAVE DASH

  2. TEXT DAVE DASH ▸ Software engineer: del.icio.us, Mozilla ▸ Formerly

    early operations engineer at Pinterest ▸ Ops Consulting
  3. DON’T BURN OUT THE NIGHT DAVE DASH

  4. LET’S MAKE ON CALL GREAT AGAIN

  5. ON CALL AS A SYSTEM

  6. THE SYSTEM IS PEOPLE!

  7. ON CALL AS A SYSTEM

  8. PUT SOMEONE IN CHARGE

  9. TRACK METRICS

  10. FIX THINGS

  11. DON’T EVER CHANGE

  12. TEXT GOOD INTENTIONS DON’T SCALE

  13. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines
  14. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow
  15. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers
  16. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100%
  17. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100% ▸ 20% are idle
  18. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100% ▸ 20% are idle ▸ NGINX isn’t running
  19. TEXT GOOD INTENTIONS DON’T SCALE

  20. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?)
  21. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config
  22. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet
  23. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet ▸ Last puppet checkin? 6:30pm
  24. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet ▸ Last puppet checkin? 6:30pm ▸ Nodes started slowly failing after people went home
  25. TEXT GOOD INTENTIONS DON’T SCALE

  26. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions

  27. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet
  28. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews
  29. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet
  30. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet ▸ logical solution
  31. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet ▸ logical solution ▸ Stop running puppet at night
  32. DON’T EVER CHANGE

  33. CURB HEROISM

  34. SOCIAL NORMS

  35. MARKET NORMS

  36. None
  37. None
  38. MARKET NORMS

  39. TRAINING

  40. ON CALL Operations

  41. None
  42. None
  43. TRAINING

  44. SCHEDULES

  45. EMPATHY

  46. PROACTIVE

  47. TEXT TL;DR

  48. TEXT TL;DR ▸ On call is a real service, give

    it an owner
  49. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to
  50. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people
  51. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers
  52. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary
  53. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary ▸ Keep noisy shifts small
  54. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary ▸ Keep noisy shifts small ▸ Have empathy
  55. LET’S MAKE ON CALL GREAT AGAIN

  56. THANKS DD@DAVEDASH.COM TWITTER EMAIL