Incident management is a very stressful and complex moment for a SRE, handle them correctly implies a rigorous organisation.
Feedback on what we have learnt and what we live daily at Synthesio.
We've going to talk about time management, fatigue, blameless culture, priorization, postmortems ...
This presentation is a non-technical presentation but for engineers / technicians, mostly devops / SRE that can benefit from our feedback to improve their daily organisation.
What we’ve learnt at SRE team
SRE Director @ Synthesio Paris
12 years of oncall
5 years of management of oncall people
SRE: Site Reliability Engineer
Engineer services for safe and rapid changes,
balanced with risks of unavailability, to
optimize end user overall satisfaction.
SLI SLO SLA
SLI: Indicator : internal metric. Identify all key technical business metrics
SLO: Objective : internal objective. Get an agreement on all Tech and product
SLA: Agreement : client facing. Get and agreement with Sales, Mkt, Support
How to succeed in Reliability
The word comes from Control theory, ~ﬁrstly theorized by James Clerk Maxwell,1868.
Control theory in control systems engineering is a subﬁeld of mathematics that deals with the
control of continuously operating dynamical systems in engineered processes and machines.
The objective is to develop a control model for controlling such systems using a control action in
an optimum manner without delay or overshoot and ensuring control stability.*
Observability comes with Controllability
for retroaction on the system to ensure
* source https://en.wikipedia.org/wiki/Control_theory
How to succeed in Reliability
Switch on the light
Set causality relations between items
Be factual, prove connections between symptoms and causes
It’s an important situation that
requires immediate human
intervention, with a lot of
Manage an unexpected event
occurring at an unexpected time,
and resolution can take an
● Businessly: client churn
● Technically: Never ever lose data
● Humanly: Fatigue is a real matter
=> Very stressful situation
Jaime Woo, DigitalOcean, src: https://www.usenix.org/node/218910
● Maximize productivity of the team
○ Maximize engineering work
○ Lower proportion of operational/Toil work, including oncall
Objectives for oncall people
● Sleep, eat, have social life as much as anyone not oncall
○ Have a stable platform
○ Stress level has to be lowered as much as we can
during oncall operations
○ Recovery time from stress and fatigue has to be part
of the work organization
● Human processing is very slow, around 60 bits/sec*, expensive, and limited
● Context switch can kill up to 40% (80% worst cases) of productivity time **
● Human can become very ineﬃcient, even counterproductive with stress
● So we have to take care of our most complex systems, humans
● During oncall, it is essential to limit unnecessary human processing and
** https://www.apa.org/research/action/multitask and https://blog.rescuetime.com/context-switching/
● Managing oncall team is managing a part of their private lives: when they can’t
go to cinema, when they risk to be woken up
● Interaction within the team has to be sane and clean,
○ at some point these interactions will occur during personal time (override,
This can work only by developing empathy
- Receive an alert
- Manage Time
- Manage Fatigue
- Quick ﬁx
std business hours
● Silent / snooze known alerts and further incoming related
alerts, to avoid interruptions, to limit noise.
Evaluate / prioritize
● Do I really have to wake up (yes, too late) and resolve
● If I already work on another alert, which one to
● What’s the business impact / risk ?
● Is data at risk ?
● Should I escalate ?
● Several level of communication:
○ Public / client-facing
○ Internal / compagny
○ Internal / tech team
○ External / Provider
● Easier when organization has speciﬁc team (support) for communication to
clients so SRE can concentrate on internal communication
● Communicate as broader as you can “Issue detected on a component so this
function is slowlier than usual and we’ll have to estimate if data has to be
restored, the team is on it, no ETA, next status in 30min”
● The message should contain:
○ what you know when a ﬁrst impact estimation has been done, but also
things you don’t know.
○ estimated time of resolution, or set a recurrence for the messages.
○ Not easy to communicate bad news, but very important as it buys you
uninterrupted time to ﬁx the problem : your well-educated colleagues will
wait till your next message without interrupting you with questions
● Log everything you’re doing :
○ Actions (command lines)
○ Thoughts, doubts, ideas, attempts, ﬁrsts interpretations
○ Links / screenshots of the graphs you observe (ie. give your sources)
○ If you post a link, always add a comment.
○ A log must include a timestamp
● A chan slack is ﬁne, in order to share this for handover or for later post-mortem
● Especially on Slack: Write sentences ! subject + verb + complements
● Writing is needed as an asynchronous way of communication, even on a
● We need background, contextualization. For instance for the rest of the
team that will backlog what happened during the night
● SLA doesn’t matter during incident, besides of estimating
● It’s too late to wonder if we meet SLAs, now time have to be used in best
● What’s important: how you progress in implementing a ﬁx as quick as
possible with current means.
● “As quick as possible”: Do you progress ? Don’t repeat the same actions.
“Insanity is doing the same thing over and over again and expecting different
results.” (not really from Einstein) (valid in science, not in training/learning conditions)
● Timebox your work, set a time for iteration :
○ 5min to 3h, depending on impact/criticity and ease to solve.
Determine what can be a good timebox before incident.
● When an iteration ends, stop the operational work and re-evaluate
incident from beginning : evaluate, communicate, manage time, manage
fatigue, imagine another quick ﬁx.
Don’t work if it’s not absolutely necessary. Don’t lose sleep time
● Escalate if you are blocked for 2 timeboxes: no progress
● Escalate if you have a doubt you can’t remove it by yourself. Even for simplest
● Evaluate your capacity to work at least 2 more timeboxes.
● If too tired, Escalate
○ If there is nobody to escalate, notice the hierarchy and have some rest.
● Take time to recover from previous night, try to sleep.
● Most of the time, adrenalin provided during the incident helps to wake up
quite early. But then ensure to take your afternoon oﬀ, there’s always an
● Allow temporary overrides and don’t judge the reason why the oncall
person need this override.
● Quick ﬁx has to be quick, it doesn’t have to be clean, just reliable enough to
resist until team can implement a better solution.
● We don’t have time to lose:
○ Human fatigue costs
○ SLA costs
○ Engineering costs (doing oncall is not doing engineering)
● Set causality relations between items
● Take decisions. If doubt or hesitation, escalate (it’s a decision too)
This moment is decisive, this is where you decide if the incident weaken or
strengthen platform and organization
● Quick ﬁx: buy time : service is up, people can sleep, clients can access the
service, pressure is low, allowing concentrate on long term solutions
● Long term ﬁx: automatization, gain of time
● Repeated quick ﬁx: loss of time
There is always a long term ﬁx, ﬁnd it ! Then ensure its implementation is
inserted in the backlog with the correct priority
● Quickly after incident, Put people of the incident in a room, write down
things, look for improvements.
● 5W’s technique to describe incident.
● Each incident needs its postmortem. For improvement and also reporting
● Log how much time has been spent on incident
● Mindset : Blameless culture: addressing work issues without attacking
● Accept failure, assume they are doing the best they can
● So you can concentrate on being factual, and look for future improvement
● Instead of “I should have communicated more, I’m sorry”
● Say : ”We haven’t communicated enough. Next time we’ll communicate
● As a consequence, accountability comes naturally
● We have to report on how diﬃcult is the oncall,
○ To be able to defend recruitments, engineering time on a roadmap,
empathy about fatigue, importance of SRE principles for devs
● Split the platforms in business categories to assign post-mortems, to spot
where the time is spent during oncall.
● It creates a way to prioritize what can be done next for platform stability.
● When doing monthly report, analyze and ensure incidents are solved from
1 month to another.
What we’ve learnt
● Timebox to ensure progress in resolution
● Communicate to enable teamwork to solve complex issues
● Be blameless to address work problems and improve
Thank you !