$30 off During Our Annual Pro Sale. View Details »

Don't Burn Out the Night (SREcon 2016)

Dave Dash
April 07, 2016

Don't Burn Out the Night (SREcon 2016)

It’s easy for engineers to overcommit to things, especially with On Call. Let’s build our On Call service that keeps people fresh and in it for the long haul.

In this session, Dave Dash will share lessons learned from Mozilla, Pinterest, and operations consulting on implementing a humane On Call service.

Dave Dash

April 07, 2016
Tweet

More Decks by Dave Dash

Other Decks in Technology

Transcript

  1. DON’T BURN OUT
    THE NIGHT
    DAVE DASH

    View Slide

  2. TEXT
    DAVE DASH
    ▸ Software engineer: del.icio.us, Mozilla
    ▸ Formerly early operations engineer at Pinterest
    ▸ Ops Consulting

    View Slide

  3. DON’T BURN OUT
    THE NIGHT
    DAVE DASH

    View Slide

  4. LET’S MAKE ON
    CALL GREAT
    AGAIN

    View Slide

  5. ON CALL AS
    A SYSTEM

    View Slide

  6. THE SYSTEM
    IS PEOPLE!

    View Slide

  7. ON CALL AS
    A SYSTEM

    View Slide

  8. PUT SOMEONE IN CHARGE

    View Slide

  9. TRACK METRICS

    View Slide

  10. FIX
    THINGS

    View Slide

  11. DON’T EVER
    CHANGE

    View Slide

  12. TEXT
    GOOD INTENTIONS DON’T SCALE

    View Slide

  13. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines

    View Slide

  14. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines
    ▸ response times for web are slow

    View Slide

  15. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines
    ▸ response times for web are slow
    ▸ we have enough servers

    View Slide

  16. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines
    ▸ response times for web are slow
    ▸ we have enough servers
    ▸ the graphs… most CPUs are pegged at 100%

    View Slide

  17. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines
    ▸ response times for web are slow
    ▸ we have enough servers
    ▸ the graphs… most CPUs are pegged at 100%
    ▸ 20% are idle

    View Slide

  18. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ puppet running on 1000s of machines
    ▸ response times for web are slow
    ▸ we have enough servers
    ▸ the graphs… most CPUs are pegged at 100%
    ▸ 20% are idle
    ▸ NGINX isn’t running

    View Slide

  19. TEXT
    GOOD INTENTIONS DON’T SCALE

    View Slide

  20. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ NGINX was restarted on these machines (OOM?)

    View Slide

  21. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ NGINX was restarted on these machines (OOM?)
    ▸ NGINX had a bad config

    View Slide

  22. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ NGINX was restarted on these machines (OOM?)
    ▸ NGINX had a bad config
    ▸ New config came from puppet

    View Slide

  23. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ NGINX was restarted on these machines (OOM?)
    ▸ NGINX had a bad config
    ▸ New config came from puppet
    ▸ Last puppet checkin? 6:30pm

    View Slide

  24. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ NGINX was restarted on these machines (OOM?)
    ▸ NGINX had a bad config
    ▸ New config came from puppet
    ▸ Last puppet checkin? 6:30pm
    ▸ Nodes started slowly failing after people went
    home

    View Slide

  25. TEXT
    GOOD INTENTIONS DON’T SCALE

    View Slide

  26. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions

    View Slide

  27. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions
    ▸ More tests for puppet

    View Slide

  28. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions
    ▸ More tests for puppet
    ▸ Better puppet code reviews

    View Slide

  29. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions
    ▸ More tests for puppet
    ▸ Better puppet code reviews
    ▸ Better alerting on puppet

    View Slide

  30. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions
    ▸ More tests for puppet
    ▸ Better puppet code reviews
    ▸ Better alerting on puppet
    ▸ logical solution

    View Slide

  31. TEXT
    GOOD INTENTIONS DON’T SCALE
    ▸ Possible solutions
    ▸ More tests for puppet
    ▸ Better puppet code reviews
    ▸ Better alerting on puppet
    ▸ logical solution
    ▸ Stop running puppet at night

    View Slide

  32. DON’T EVER
    CHANGE

    View Slide

  33. CURB
    HEROISM

    View Slide

  34. SOCIAL
    NORMS

    View Slide

  35. MARKET
    NORMS

    View Slide

  36. View Slide

  37. View Slide

  38. MARKET
    NORMS

    View Slide

  39. TRAINING

    View Slide

  40. ON CALL
    Operations

    View Slide

  41. View Slide

  42. View Slide

  43. TRAINING

    View Slide

  44. SCHEDULES

    View Slide

  45. EMPATHY

    View Slide

  46. PROACTIVE

    View Slide

  47. TEXT
    TL;DR

    View Slide

  48. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner

    View Slide

  49. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to

    View Slide

  50. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to
    ▸ Try paying people

    View Slide

  51. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to
    ▸ Try paying people
    ▸ Set expectations about on call and on callers

    View Slide

  52. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to
    ▸ Try paying people
    ▸ Set expectations about on call and on callers
    ▸ Have a primary and a secondary

    View Slide

  53. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to
    ▸ Try paying people
    ▸ Set expectations about on call and on callers
    ▸ Have a primary and a secondary
    ▸ Keep noisy shifts small

    View Slide

  54. TEXT
    TL;DR
    ▸ On call is a real service, give it an owner
    ▸ Don’t do things at night unless you have to
    ▸ Try paying people
    ▸ Set expectations about on call and on callers
    ▸ Have a primary and a secondary
    ▸ Keep noisy shifts small
    ▸ Have empathy

    View Slide

  55. LET’S MAKE ON
    CALL GREAT
    AGAIN

    View Slide

  56. THANKS
    [email protected]
    TWITTER
    EMAIL

    View Slide