Upgrade to Pro — share decks privately, control downloads, hide ads and more …

On-Call Fundamentals and Good Practices

On-Call Fundamentals and Good Practices

This is the presentation I made in London Underground meetup on on-call fundamentals and some of the good practices which will help teams that consider putting their engineer on-call.

Serhat Can

April 24, 2018
Tweet

More Decks by Serhat Can

Other Decks in Technology

Transcript

  1. About me • Ex-Software Engineer and Technical Evangelist at •

    Still on-call • Co-organizer ◦ DevOpsDays İstanbul ◦ DevOps Turkey Meetup ◦ Serverless Turkey Meetup • @srhtcn on Twitter
  2. 1 in 5 employees in EU are on-call Taylor &

    Francis. (2015, August 4). How does being 'on-call' impact employee fatigue?. ScienceDaily. Retrieved April 5, 2018 from www.sciencedaily.com/releases/2015/08/150804074052.htm
  3. On-call in IT NOC (Network Operation Center) Operations engineers, DevOps,

    SRE - whatever you call Developers Customer success Sales Finance
  4. *The Ponemon Institute & Emerson Network Power The impact of

    downtime & performance degradation Direct revenue loss Unhappy users Loss of credibility
  5. Alert An event that needs to be addressed - Combines

    one or more alerts - Could have been prevented - Different level of priorities Incident A warning about something happening - Main informational unit of an incident - Does not have to be negative - Does not have to notify people - Different level of priorities
  6. Incident > Alert > Notification Incident Alert Alert Alert Alert

    Alert Alert SMS Mobile Push Email Voice Notifications Tools create alerts
  7. On-call types - Outsourced or dedicated on-call teams - SysAdmins

    or Operations engineers - Everyone involved in developing and operating software Disclaimer: these can change based on what your company does, at what level of abstraction they use hardware or software, and the company size.
  8. Outsourced or dedicated on-call Advantages: - Enforce SLAs - Specialized

    task-force Disadvantages: - Prone to human error - No knowledge of the internals of the system - Increased MTTR - Hard to measure, hence improve
  9. Sys Admins & Operations Engineers Pre-DevOps - Only Operations engineers

    or SysAdmins are on-call because they “run” those services SysAdmin/Ops Engineer SysAdmin/Ops Engineer If lucky, there is a senior backup Alert received
  10. Problems with putting “only” Ops on-call Lack of ownership for

    development teams Often means low quality code, less observable and more failure on production Increased MTTR Ops doesn’t know the internals Burn-out for Ops people They are too often on-call Image source: https://pre00.deviantart.net/b620/th/pre/f/2015/144/8/1/sad_pikachu_by_bekkistevenson-d8ujru5.png
  11. On-call at Google For new (not so reliable) services For

    established services SRE (Site Reliability Engineers) Developers SRE Google requires development teams to run their own services if those systems aren’t stable.
  12. On-call at AirBnb, Pinterest, NewRelic Developers Developers are on-call for

    their services, but have SREs working alongside them (usually “embedded” within the team) Airbnb discovered that having a separate operations team “creates a divide and simply doesn’t scale.” SRE (Site Reliability Engineers)
  13. On-call at Datadog, Digital Ocean, and Dropbox Operations Developers DigitalOcean:

    development teams are on-call for their services, while operations teams are on-call for the interactions between the services.
  14. On-call at AWS Developers Developers are responsible for all development

    and operational tasks associated with their services. Amazon’s cultural emphasis on “ownership”: you don’t “own” the code you write, Amazon says, unless you run and maintain it, too.
  15. Be clear on what you expect from on-call Transparency is

    key Determine if your people will have additional development duties during on-call If not, what will they do when they don’t have alerts coming up on their way If they worked on an incident whole night, can they take the next day off or come late Make it written and available to everyone - don’t just say it, mean it!
  16. Important responsibilities of on-call Be reachable and have your laptop

    with you during your shift Call for help whenever you are not sure what to do Determine the severity of the incident If necessary, bring responsible team / additional team members together and hand over the command to the incident commander Hand-over the incident properly if on-call is over
  17. KPIs of on-call How many alerts received Priority levels of

    those alerts Alerts per status (open, acked, closed) MTTA: Mean Time to Acknowledge MTTR: Mean Time to Resolve
  18. Minimum requirements • Reliable alerting • Multiple notification channels •

    Personal notification preferences • Alert fatigue prevention • Advanced collaboration/communication tools • Monitoring • Metrics and dashboards