Incident Management - What we have learnt

9f4856c6a24e8c9ae419982af1828525?s=47 gestassy
November 18, 2019

Incident Management - What we have learnt

Incident management is a very stressful and complex moment for a SRE, handle them correctly implies a rigorous organisation.

Feedback on what we have learnt and what we live daily at Synthesio.

We've going to talk about time management, fatigue, blameless culture, priorization, postmortems ...

This presentation is a non-technical presentation but for engineers / technicians, mostly devops / SRE that can benefit from our feedback to improve their daily organisation.

9f4856c6a24e8c9ae419982af1828525?s=128

gestassy

November 18, 2019
Tweet

Transcript

  1. 2.

    Whoami Guillaume Estassy SRE Director @ Synthesio Paris 12 years

    of oncall 5 years of management of oncall people @gestassy
  2. 5.

    SRE: Site Reliability Engineer Engineer services for safe and rapid

    changes, balanced with risks of unavailability, to optimize end user overall satisfaction.
  3. 6.

    SLI SLO SLA SLI: Indicator : internal metric. Identify all

    key technical business metrics SLO: Objective : internal objective. Get an agreement on all Tech and product SLA: Agreement : client facing. Get and agreement with Sales, Mkt, Support
  4. 7.

    Observability How to succeed in Reliability The word comes from

    Control theory, ~firstly theorized by James Clerk Maxwell,1868. Control theory in control systems engineering is a subfield of mathematics that deals with the control of continuously operating dynamical systems in engineered processes and machines. The objective is to develop a control model for controlling such systems using a control action in an optimum manner without delay or overshoot and ensuring control stability.* Observability comes with Controllability for retroaction on the system to ensure stability * source https://en.wikipedia.org/wiki/Control_theory
  5. 8.

    Observability How to succeed in Reliability • Logs • Metrics

    • Tracing • Visualization • Alerting Switch on the light Set causality relations between items Prioritize information Be factual, prove connections between symptoms and causes
  6. 9.

    Incident It’s an important situation that requires immediate human intervention,

    with a lot of unknowns : Manage an unexpected event occurring at an unexpected time, and resolution can take an unexpected duration
  7. 10.

    Incident Risks • Businessly: client churn • Technically: Never ever

    lose data • Humanly: Fatigue is a real matter => Very stressful situation
  8. 12.

    Incident management Objective • Maximize productivity of the team ◦

    Maximize engineering work ◦ Lower proportion of operational/Toil work, including oncall
  9. 13.

    Incident management Objectives for oncall people • Sleep, eat, have

    social life as much as anyone not oncall ◦ Have a stable platform ◦ Stress level has to be lowered as much as we can during oncall operations ◦ Recovery time from stress and fatigue has to be part of the work organization
  10. 14.

    Incident management Human • Human processing is very slow, around

    60 bits/sec*, expensive, and limited • Context switch can kill up to 40% (80% worst cases) of productivity time ** • Human can become very inefficient, even counterproductive with stress • So we have to take care of our most complex systems, humans • During oncall, it is essential to limit unnecessary human processing and context switch * https://worldmentalcalculation.com/2019/06/30/fastest-possible-processing-speed-of-the-human-brain/ ** https://www.apa.org/research/action/multitask and https://blog.rescuetime.com/context-switching/
  11. 15.

    Incident management Empathy • Managing oncall team is managing a

    part of their private lives: when they can’t go to cinema, when they risk to be woken up • Interaction within the team has to be sane and clean, ◦ at some point these interactions will occur during personal time (override, escalation, collaboration) This can work only by developing empathy
  12. 16.

    Incident management Steps - Receive an alert - Evaluate/Prioritize -

    Communicate - Manage Time - Manage Fatigue - Quick fix - Debrief - Report - Improve Incident time std business hours
  13. 17.

    Incident management Receive Alert • Silent / snooze known alerts

    and further incoming related alerts, to avoid interruptions, to limit noise.
  14. 18.

    Incident management Evaluate / prioritize • Do I really have

    to wake up (yes, too late) and resolve this now? • If I already work on another alert, which one to prioritize ? • What’s the business impact / risk ? • Is data at risk ? • Should I escalate ?
  15. 19.

    Incident management Communicate • Several level of communication: ◦ Public

    / client-facing ◦ Internal / compagny ◦ Internal / tech team ◦ External / Provider • Easier when organization has specific team (support) for communication to clients so SRE can concentrate on internal communication
  16. 20.

    Incident management Communicate • Communicate as broader as you can

    “Issue detected on a component so this function is slowlier than usual and we’ll have to estimate if data has to be restored, the team is on it, no ETA, next status in 30min” • The message should contain: ◦ what you know when a first impact estimation has been done, but also things you don’t know. ◦ estimated time of resolution, or set a recurrence for the messages. ◦ Not easy to communicate bad news, but very important as it buys you uninterrupted time to fix the problem : your well-educated colleagues will wait till your next message without interrupting you with questions
  17. 21.

    Incident management Communicate • Log everything you’re doing : ◦

    Actions (command lines) ◦ Thoughts, doubts, ideas, attempts, firsts interpretations ◦ Links / screenshots of the graphs you observe (ie. give your sources) ◦ If you post a link, always add a comment. ◦ A log must include a timestamp • A chan slack is fine, in order to share this for handover or for later post-mortem
  18. 22.

    Incident management Communicate • Especially on Slack: Write sentences !

    subject + verb + complements • Writing is needed as an asynchronous way of communication, even on a chat mode. • We need background, contextualization. For instance for the rest of the team that will backlog what happened during the night
  19. 23.

    Incident management Time Management • SLA doesn’t matter during incident,

    besides of estimating criticality/escalations needs. • It’s too late to wonder if we meet SLAs, now time have to be used in best effort mode. • What’s important: how you progress in implementing a fix as quick as possible with current means. • “As quick as possible”: Do you progress ? Don’t repeat the same actions. “Insanity is doing the same thing over and over again and expecting different results.” (not really from Einstein) (valid in science, not in training/learning conditions)
  20. 24.

    Incident management Time Management • Timebox your work, set a

    time for iteration : ◦ 5min to 3h, depending on impact/criticity and ease to solve. Determine what can be a good timebox before incident. • When an iteration ends, stop the operational work and re-evaluate incident from beginning : evaluate, communicate, manage time, manage fatigue, imagine another quick fix.
  21. 25.

    Incident management Fatigue Management Don’t work if it’s not absolutely

    necessary. Don’t lose sleep time Escalation • Escalate if you are blocked for 2 timeboxes: no progress • Escalate if you have a doubt you can’t remove it by yourself. Even for simplest operations, sometimes. • Evaluate your capacity to work at least 2 more timeboxes. • If too tired, Escalate ◦ If there is nobody to escalate, notice the hierarchy and have some rest.
  22. 26.

    Incident management Fatigue Management Recovery • Take time to recover

    from previous night, try to sleep. • Most of the time, adrenalin provided during the incident helps to wake up quite early. But then ensure to take your afternoon off, there’s always an aftereffect. Shifts • Allow temporary overrides and don’t judge the reason why the oncall person need this override.
  23. 27.

    Incident management Quick fix • Quick fix has to be

    quick, it doesn’t have to be clean, just reliable enough to resist until team can implement a better solution. • We don’t have time to lose: ◦ Human fatigue costs ◦ SLA costs ◦ Engineering costs (doing oncall is not doing engineering) • Set causality relations between items • Take decisions. If doubt or hesitation, escalate (it’s a decision too)
  24. 28.

    Incident management PostMortem This moment is decisive, this is where

    you decide if the incident weaken or strengthen platform and organization • Quick fix: buy time : service is up, people can sleep, clients can access the service, pressure is low, allowing concentrate on long term solutions • Long term fix: automatization, gain of time • Repeated quick fix: loss of time
  25. 29.

    Incident management PostMortem There is always a long term fix,

    find it ! Then ensure its implementation is inserted in the backlog with the correct priority • Quickly after incident, Put people of the incident in a room, write down things, look for improvements. • 5W’s technique to describe incident. • Each incident needs its postmortem. For improvement and also reporting • Log how much time has been spent on incident
  26. 30.

    Incident management PostMortem Culture • Mindset : Blameless culture: addressing

    work issues without attacking persons. • Accept failure, assume they are doing the best they can • So you can concentrate on being factual, and look for future improvement • Instead of “I should have communicated more, I’m sorry” • Say : ”We haven’t communicated enough. Next time we’ll communicate more regularly”. • As a consequence, accountability comes naturally
  27. 31.

    Incident management Reporting • We have to report on how

    difficult is the oncall, ◦ To be able to defend recruitments, engineering time on a roadmap, empathy about fatigue, importance of SRE principles for devs • Split the platforms in business categories to assign post-mortems, to spot where the time is spent during oncall. • It creates a way to prioritize what can be done next for platform stability. • When doing monthly report, analyze and ensure incidents are solved from 1 month to another.
  28. 32.

    Incident management What we’ve learnt • Timebox to ensure progress

    in resolution • Communicate to enable teamwork to solve complex issues • Be blameless to address work problems and improve