Incident Management - What we have learnt

Incident Management What we’ve learnt at SRE team

Whoami Guillaume Estassy SRE Director @ Synthesio Paris 12 years
of oncall 5 years of management of oncall people @gestassy

Synthesio

SRE: Site Reliability Engineer Engineer services for safe and rapid
changes, balanced with risks of unavailability, to optimize end user overall satisfaction.

SLI SLO SLA SLI: Indicator : internal metric. Identify all
key technical business metrics SLO: Objective : internal objective. Get an agreement on all Tech and product SLA: Agreement : client facing. Get and agreement with Sales, Mkt, Support

Observability How to succeed in Reliability The word comes from
Control theory, ~ﬁrstly theorized by James Clerk Maxwell,1868. Control theory in control systems engineering is a subﬁeld of mathematics that deals with the control of continuously operating dynamical systems in engineered processes and machines. The objective is to develop a control model for controlling such systems using a control action in an optimum manner without delay or overshoot and ensuring control stability.* Observability comes with Controllability for retroaction on the system to ensure stability * source https://en.wikipedia.org/wiki/Control_theory

Observability How to succeed in Reliability • Logs • Metrics
• Tracing • Visualization • Alerting Switch on the light Set causality relations between items Prioritize information Be factual, prove connections between symptoms and causes

Incident It’s an important situation that requires immediate human intervention,
with a lot of unknowns : Manage an unexpected event occurring at an unexpected time, and resolution can take an unexpected duration

Incident Risks • Businessly: client churn • Technically: Never ever
lose data • Humanly: Fatigue is a real matter => Very stressful situation

Incident Oncall stress Jaime Woo, DigitalOcean, src: https://www.usenix.org/node/218910

Incident management Objective • Maximize productivity of the team ◦
Maximize engineering work ◦ Lower proportion of operational/Toil work, including oncall

Incident management Objectives for oncall people • Sleep, eat, have
social life as much as anyone not oncall ◦ Have a stable platform ◦ Stress level has to be lowered as much as we can during oncall operations ◦ Recovery time from stress and fatigue has to be part of the work organization

Incident management Human • Human processing is very slow, around
60 bits/sec*, expensive, and limited • Context switch can kill up to 40% (80% worst cases) of productivity time ** • Human can become very ineﬃcient, even counterproductive with stress • So we have to take care of our most complex systems, humans • During oncall, it is essential to limit unnecessary human processing and context switch * https://worldmentalcalculation.com/2019/06/30/fastest-possible-processing-speed-of-the-human-brain/ ** https://www.apa.org/research/action/multitask and https://blog.rescuetime.com/context-switching/

Incident management Empathy • Managing oncall team is managing a
part of their private lives: when they can’t go to cinema, when they risk to be woken up • Interaction within the team has to be sane and clean, ◦ at some point these interactions will occur during personal time (override, escalation, collaboration) This can work only by developing empathy

Incident management Steps - Receive an alert - Evaluate/Prioritize -
Communicate - Manage Time - Manage Fatigue - Quick ﬁx - Debrief - Report - Improve Incident time std business hours

Incident management Receive Alert • Silent / snooze known alerts
and further incoming related alerts, to avoid interruptions, to limit noise.

Incident management Evaluate / prioritize • Do I really have
to wake up (yes, too late) and resolve this now? • If I already work on another alert, which one to prioritize ? • What’s the business impact / risk ? • Is data at risk ? • Should I escalate ?

Incident management Communicate • Several level of communication: ◦ Public
/ client-facing ◦ Internal / compagny ◦ Internal / tech team ◦ External / Provider • Easier when organization has speciﬁc team (support) for communication to clients so SRE can concentrate on internal communication

Incident management Communicate • Communicate as broader as you can
“Issue detected on a component so this function is slowlier than usual and we’ll have to estimate if data has to be restored, the team is on it, no ETA, next status in 30min” • The message should contain: ◦ what you know when a ﬁrst impact estimation has been done, but also things you don’t know. ◦ estimated time of resolution, or set a recurrence for the messages. ◦ Not easy to communicate bad news, but very important as it buys you uninterrupted time to ﬁx the problem : your well-educated colleagues will wait till your next message without interrupting you with questions

Incident management Communicate • Log everything you’re doing : ◦
Actions (command lines) ◦ Thoughts, doubts, ideas, attempts, ﬁrsts interpretations ◦ Links / screenshots of the graphs you observe (ie. give your sources) ◦ If you post a link, always add a comment. ◦ A log must include a timestamp • A chan slack is ﬁne, in order to share this for handover or for later post-mortem

Incident management Communicate • Especially on Slack: Write sentences !
subject + verb + complements • Writing is needed as an asynchronous way of communication, even on a chat mode. • We need background, contextualization. For instance for the rest of the team that will backlog what happened during the night

Incident management Time Management • SLA doesn’t matter during incident,
besides of estimating criticality/escalations needs. • It’s too late to wonder if we meet SLAs, now time have to be used in best eﬀort mode. • What’s important: how you progress in implementing a ﬁx as quick as possible with current means. • “As quick as possible”: Do you progress ? Don’t repeat the same actions. “Insanity is doing the same thing over and over again and expecting different results.” (not really from Einstein) (valid in science, not in training/learning conditions)

Incident management Time Management • Timebox your work, set a
time for iteration : ◦ 5min to 3h, depending on impact/criticity and ease to solve. Determine what can be a good timebox before incident. • When an iteration ends, stop the operational work and re-evaluate incident from beginning : evaluate, communicate, manage time, manage fatigue, imagine another quick ﬁx.

Incident management Fatigue Management Don’t work if it’s not absolutely
necessary. Don’t lose sleep time Escalation • Escalate if you are blocked for 2 timeboxes: no progress • Escalate if you have a doubt you can’t remove it by yourself. Even for simplest operations, sometimes. • Evaluate your capacity to work at least 2 more timeboxes. • If too tired, Escalate ◦ If there is nobody to escalate, notice the hierarchy and have some rest.

Incident management Fatigue Management Recovery • Take time to recover
from previous night, try to sleep. • Most of the time, adrenalin provided during the incident helps to wake up quite early. But then ensure to take your afternoon oﬀ, there’s always an aftereﬀect. Shifts • Allow temporary overrides and don’t judge the reason why the oncall person need this override.

Incident management Quick ﬁx • Quick ﬁx has to be
quick, it doesn’t have to be clean, just reliable enough to resist until team can implement a better solution. • We don’t have time to lose: ◦ Human fatigue costs ◦ SLA costs ◦ Engineering costs (doing oncall is not doing engineering) • Set causality relations between items • Take decisions. If doubt or hesitation, escalate (it’s a decision too)

Incident management PostMortem This moment is decisive, this is where
you decide if the incident weaken or strengthen platform and organization • Quick fix: buy time : service is up, people can sleep, clients can access the service, pressure is low, allowing concentrate on long term solutions • Long term fix: automatization, gain of time • Repeated quick fix: loss of time

Incident management PostMortem There is always a long term ﬁx,
ﬁnd it ! Then ensure its implementation is inserted in the backlog with the correct priority • Quickly after incident, Put people of the incident in a room, write down things, look for improvements. • 5W’s technique to describe incident. • Each incident needs its postmortem. For improvement and also reporting • Log how much time has been spent on incident

Incident management PostMortem Culture • Mindset : Blameless culture: addressing
work issues without attacking persons. • Accept failure, assume they are doing the best they can • So you can concentrate on being factual, and look for future improvement • Instead of “I should have communicated more, I’m sorry” • Say : ”We haven’t communicated enough. Next time we’ll communicate more regularly”. • As a consequence, accountability comes naturally

Incident management Reporting • We have to report on how
diﬃcult is the oncall, ◦ To be able to defend recruitments, engineering time on a roadmap, empathy about fatigue, importance of SRE principles for devs • Split the platforms in business categories to assign post-mortems, to spot where the time is spent during oncall. • It creates a way to prioritize what can be done next for platform stability. • When doing monthly report, analyze and ensure incidents are solved from 1 month to another.

Incident management What we’ve learnt • Timebox to ensure progress
in resolution • Communicate to enable teamwork to solve complex issues • Be blameless to address work problems and improve

Thank you ! Questions ?

Incident Management - What we have learnt

Incident Management - What we have learnt

gestassy

Other Decks in Programming

Featured

Transcript

Incident Management What we’ve learnt at SRE team

Whoami Guillaume Estassy SRE Director @ Synthesio Paris 12 years

Synthesio

Synthesio

SRE: Site Reliability Engineer Engineer services for safe and rapid

SLI SLO SLA SLI: Indicator : internal metric. Identify all

Observability How to succeed in Reliability The word comes from

Observability How to succeed in Reliability • Logs • Metrics

Incident It’s an important situation that requires immediate human intervention,

Incident Risks • Businessly: client churn • Technically: Never ever

Incident Oncall stress Jaime Woo, DigitalOcean, src: https://www.usenix.org/node/218910

Incident management Objective • Maximize productivity of the team ◦

Incident management Objectives for oncall people • Sleep, eat, have

Incident management Human • Human processing is very slow, around

Incident management Empathy • Managing oncall team is managing a

Incident management Steps - Receive an alert - Evaluate/Prioritize -

Incident management Receive Alert • Silent / snooze known alerts

Incident management Evaluate / prioritize • Do I really have

Incident management Communicate • Several level of communication: ◦ Public

Incident management Communicate • Communicate as broader as you can

Incident management Communicate • Log everything you’re doing : ◦

Incident management Communicate • Especially on Slack: Write sentences !

Incident management Time Management • SLA doesn’t matter during incident,

Incident management Time Management • Timebox your work, set a

Incident management Fatigue Management Don’t work if it’s not absolutely

Incident management Fatigue Management Recovery • Take time to recover

Incident management Quick ﬁx • Quick ﬁx has to be

Incident management PostMortem This moment is decisive, this is where

Incident management PostMortem There is always a long term ﬁx,

Incident management PostMortem Culture • Mindset : Blameless culture: addressing

Incident management Reporting • We have to report on how

Incident management What we’ve learnt • Timebox to ensure progress

Thank you ! Questions ?