Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOpsDaysPortugal 2019 - Pedro Torres - Unicor...

DevOpsDaysPortugal 2019 - Pedro Torres - Unicorn on-call

Come and meet the story behind a unicorn’s on-call program. Almost three years ago we didn’t have a structured program to support on-call engineers… and with the scaling of the engineering team it was pretty clear that we needed something in place. We went from zero to have software engineers on-call and now we even have a 2.0 version of the program. Join me travelling through time and get a grasp about how it was to create a on-call program from scratch and the pitfalls that we faced.

Avatar for DevOpsDaysPortugal

DevOpsDaysPortugal

June 04, 2019
Tweet

More Decks by DevOpsDaysPortugal

Other Decks in Technology

Transcript

  1. Hi there! I’m Pedro! • Engineering Director @ • Impact-driven

    person • Passionate about People, Technology, and Products • Agile, Lean and DevOps aficionado • 10+ years of experience running engineering teams
  2. On-call :: Definition (of a person) able to be contacted

    in order to provide a professional service if necessary, but not formally on duty. ‘The team is on call 24 hours-a-day, and is trained in resuscitation techniques and how to use live-saving defibrillators.’ ‘If you work in a global organization, you might be on call 24 hours a day for troubleshooting or consulting.’ ‘You have to get up in the middle of the night if you're on call.’
  3. Tool Age • Tons of alarms • False positives (Broken

    windows theory https://en.wikipedia.org/wiki/Broken_windows_theory) • MTTA not tracked • MTTR “over 9000” • All systems were on-call (Because none was… so all of them were)
  4. Tool Age • (Not so) Cool Stats • Alarms triggered:

    2356 • Days with alarms triggered: 279 • MTTA: 456 seconds / ~ 7,5 minutes • MTTR: 1029 seconds / ~ 17 minutes • PMs written: 75 • Uptime: 99,919%
  5. Bronze Age • We evaluated 3 scenarios: “Primary / Secondary”,

    “Just primary” and “Primary / Secondary (SRE)” • SRE team covering own rota (infra one) –> We rebranded the Ops team to SRE team • Development teams with rotas (dedicated to their systems) • One engineer per rota (no secondaries) • Engineers on-call (eat your own dog food: you develop it… you maintain it in PROD!)
  6. Bronze Age • Tools: One hotspot per rota (no smartphones

    so that we don’t make people carry two devices) + VictorOps App
  7. Bronze Age • One week rotas (four rotas in total)

    • The rotas start / end every Tuesday (i.e. End-of-Sprint day) aligning the rotas calendar with the sprints calendar
  8. Bronze Age • Only critical systems covered by the program

    (defined by Engineering and agreed with stakeholders (e.g. Product, Customer Services, Support))
  9. Bronze Age • Incident commander defined - The Incident Commander

    (IC) holds the high-level state about the incident. They structure the incident response task force, assigning responsibilities according to need and priority
  10. Bronze Age • Alarms fine tuned • Defined time to

    Ack under 5 minutes • Redefined thresholds • Distinguished Alarms from Notifications: The alarm requires immediate action. The notification can wait for the next day or so • Cleaned up alarms from non Production environments
  11. Bronze Age • Volunteer based and not compulsory based (Yeah…

    we ran into “trouble” and I went on-call because of that: eat your own dog food… lead by example… I took 4 consecutive weeks on-call)
  12. Bronze Age • Acacio’s list when joining the program (origin:

    internal meetup with Acacio Cruz –> Google SRE and co-author of Google SRE book)
  13. Bronze Age • Little time to work on the resiliency

    of systems (hard to prioritize and hard to complete action points from PMs during sprints)
  14. Bronze Age • On-call procedure • Updating the company’s status

    page • Keeping the organization/stakeholders informed with the incident status every 5 minutes
  15. Bronze Age • Performance reviews completely disassociated from the on-call

    program (no one gets a worst review because of not participating in the program)
  16. Bronze Age • Although we have offices in different time

    zones we didn’t use a “follow the sun” strategy (lack of engineers in the US)
  17. Bronze Age • P0s are all-hands on deck and we

    are “entitled” to call all engineers that can help • Panic button on slack with Zappier integration
  18. Bronze Age • Cool Stats • Alarms triggered: 1583 (-33%)

    • Days with alarms triggered: 292 (+5%) • MTTA: 41 seconds (-91%) • MTTR: 424 seconds / ~ 7 minutes (-59%) • PMs written: 123 (+64%) • Uptime: 99,983% (+0,064%)
  19. Iron Age • (Really) Cool Stats • Alarms triggered: 454

    (-14%) • Days with alarms triggered: 100 (+3%) • MTTA: 32 seconds (-21%) • MTTR: 374 seconds / ~ 6 minutes (-12%) • PMs written: 34 (-17%) • Uptime: 99,996% (+0,013%)
  20. Final thoughts • Although the engineers are being paid to

    be on-call… don’t forget that they are doing us a favor!
  21. Final thoughts • Google SRE book is a great inspiration

    (and an herculean task to read the entire book… 552 pages!)
  22. Final thoughts • Burnout is a real thing… it affects

    performance and churn… but most importantly… health!
  23. Final thoughts • Don’t make rushed decisions because you are

    getting too many alerts (e.g. turning off alarms)
  24. Final thoughts • Take advantage of the business hours (when

    you have the entire engineering team at the office) to tackle issues that might come up during out-of-business hours (when you “only” have the on-call engineers available)
  25. Final thoughts • Being on-call doesn’t mean that you need

    to save the world. We don’t need “Rambos”… so play it safe, stick to the playbooks and don’t make risky decisions under stress
  26. Final thoughts • Don’t hesitate to jump into a (video)

    call to coordinate the incident resolution (usually Slack is not enough) – sync vs async comms
  27. Final thoughts • Don’t forget to keep the stakeholders in

    the loop (we are in the heat zone… but they are suffering from the sideline… and they need to know what is happening)
  28. Final thoughts • Action items on (Blameless) post mortems should

    be tracked and assured that they are executed
  29. Final thoughts • Don’t fall into the wishful thinking game:

    if you believe/suspect that an alarm is triggered by something harmless that you “can’t control” (e.g. network glitch)… be ready to prove that… otherwise don’t stop investigating the root cause
  30. Final thoughts • Always write PMs (for PEs and PIs)

    and bare in mind that you should have public versions of the PM (sooner or later your customers will ask for them)