Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Don't Burn Out the Night (SREcon 2016)

Dave Dash
April 07, 2016

Don't Burn Out the Night (SREcon 2016)

It’s easy for engineers to overcommit to things, especially with On Call. Let’s build our On Call service that keeps people fresh and in it for the long haul.

In this session, Dave Dash will share lessons learned from Mozilla, Pinterest, and operations consulting on implementing a humane On Call service.

Dave Dash

April 07, 2016
Tweet

More Decks by Dave Dash

Other Decks in Technology

Transcript

  1. TEXT DAVE DASH ▸ Software engineer: del.icio.us, Mozilla ▸ Formerly

    early operations engineer at Pinterest ▸ Ops Consulting
  2. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow
  3. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers
  4. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100%
  5. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100% ▸ 20% are idle
  6. TEXT GOOD INTENTIONS DON’T SCALE ▸ puppet running on 1000s

    of machines ▸ response times for web are slow ▸ we have enough servers ▸ the graphs… most CPUs are pegged at 100% ▸ 20% are idle ▸ NGINX isn’t running
  7. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config
  8. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet
  9. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet ▸ Last puppet checkin? 6:30pm
  10. TEXT GOOD INTENTIONS DON’T SCALE ▸ NGINX was restarted on

    these machines (OOM?) ▸ NGINX had a bad config ▸ New config came from puppet ▸ Last puppet checkin? 6:30pm ▸ Nodes started slowly failing after people went home
  11. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews
  12. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet
  13. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet ▸ logical solution
  14. TEXT GOOD INTENTIONS DON’T SCALE ▸ Possible solutions ▸ More

    tests for puppet ▸ Better puppet code reviews ▸ Better alerting on puppet ▸ logical solution ▸ Stop running puppet at night
  15. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to
  16. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people
  17. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers
  18. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary
  19. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary ▸ Keep noisy shifts small
  20. TEXT TL;DR ▸ On call is a real service, give

    it an owner ▸ Don’t do things at night unless you have to ▸ Try paying people ▸ Set expectations about on call and on callers ▸ Have a primary and a secondary ▸ Keep noisy shifts small ▸ Have empathy