This is the presentation I made in London Underground meetup on on-call fundamentals and some of the good practices which will help teams that consider putting their engineer on-call.
Francis. (2015, August 4). How does being 'on-call' impact employee fatigue?. ScienceDaily. Retrieved April 5, 2018 from www.sciencedaily.com/releases/2015/08/150804074052.htm
one or more alerts - Could have been prevented - Different level of priorities Incident A warning about something happening - Main informational unit of an incident - Does not have to be negative - Does not have to notify people - Different level of priorities
or Operations engineers - Everyone involved in developing and operating software Disclaimer: these can change based on what your company does, at what level of abstraction they use hardware or software, and the company size.
or SysAdmins are on-call because they “run” those services SysAdmin/Ops Engineer SysAdmin/Ops Engineer If lucky, there is a senior backup Alert received
development teams Often means low quality code, less observable and more failure on production Increased MTTR Ops doesn’t know the internals Burn-out for Ops people They are too often on-call Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
established services SRE (Site Reliability Engineers) Developers SRE Google requires development teams to run their own services if those systems aren’t stable.
their services, but have SREs working alongside them (usually “embedded” within the team) Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.” SRE (Site Reliability Engineers)
and operational tasks associated with their services. Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
key Determine if your people will have additional development duties during on-call If not, what will they do when they don’t have alerts coming up on their way If they worked on an incident whole night, can they take the next day off or come late Make it written and available to everyone - don’t just say it, mean it!
with you during your shift Call for help whenever you are not sure what to do Determine the severity of the incident If necessary, bring responsible team / additional team members together and hand over the command to the incident commander Hand-over the incident properly if on-call is over