Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Incident Management - What we have learnt

gestassy
November 18, 2019

Incident Management - What we have learnt

Incident management is a very stressful and complex moment for a SRE, handle them correctly implies a rigorous organisation.

Feedback on what we have learnt and what we live daily at Synthesio.

We've going to talk about time management, fatigue, blameless culture, priorization, postmortems ...

This presentation is a non-technical presentation but for engineers / technicians, mostly devops / SRE that can benefit from our feedback to improve their daily organisation.

gestassy

November 18, 2019
Tweet

Other Decks in Programming

Transcript

  1. Incident Management
    What we’ve learnt at SRE team

    View Slide

  2. Whoami
    Guillaume Estassy
    SRE Director @ Synthesio Paris
    12 years of oncall
    5 years of management of oncall people
    @gestassy

    View Slide

  3. Synthesio

    View Slide

  4. Synthesio

    View Slide

  5. SRE: Site Reliability Engineer
    Engineer services for safe and rapid changes,
    balanced with risks of unavailability, to
    optimize end user overall satisfaction.

    View Slide

  6. SLI SLO SLA
    SLI: Indicator : internal metric. Identify all key technical business metrics
    SLO: Objective : internal objective. Get an agreement on all Tech and product
    SLA: Agreement : client facing. Get and agreement with Sales, Mkt, Support

    View Slide

  7. Observability
    How to succeed in Reliability
    The word comes from Control theory, ~firstly theorized by James Clerk Maxwell,1868.
    Control theory in control systems engineering is a subfield of mathematics that deals with the
    control of continuously operating dynamical systems in engineered processes and machines.
    The objective is to develop a control model for controlling such systems using a control action in
    an optimum manner without delay or overshoot and ensuring control stability.*
    Observability comes with Controllability
    for retroaction on the system to ensure
    stability
    * source https://en.wikipedia.org/wiki/Control_theory

    View Slide

  8. Observability
    How to succeed in Reliability
    ● Logs
    ● Metrics
    ● Tracing
    ● Visualization
    ● Alerting
    Switch on the light
    Set causality relations between items
    Prioritize information
    Be factual, prove connections between symptoms and causes

    View Slide

  9. Incident
    It’s an important situation that
    requires immediate human
    intervention, with a lot of
    unknowns :
    Manage an unexpected event
    occurring at an unexpected time,
    and resolution can take an
    unexpected duration

    View Slide

  10. Incident
    Risks
    ● Businessly: client churn
    ● Technically: Never ever lose data
    ● Humanly: Fatigue is a real matter
    => Very stressful situation

    View Slide

  11. Incident
    Oncall stress
    Jaime Woo, DigitalOcean, src: https://www.usenix.org/node/218910

    View Slide

  12. Incident management
    Objective
    ● Maximize productivity of the team
    ○ Maximize engineering work
    ○ Lower proportion of operational/Toil work, including oncall

    View Slide

  13. Incident management
    Objectives for oncall people
    ● Sleep, eat, have social life as much as anyone not oncall
    ○ Have a stable platform
    ○ Stress level has to be lowered as much as we can
    during oncall operations
    ○ Recovery time from stress and fatigue has to be part
    of the work organization

    View Slide

  14. Incident management
    Human
    ● Human processing is very slow, around 60 bits/sec*, expensive, and limited
    ● Context switch can kill up to 40% (80% worst cases) of productivity time **
    ● Human can become very inefficient, even counterproductive with stress
    ● So we have to take care of our most complex systems, humans
    ● During oncall, it is essential to limit unnecessary human processing and
    context switch
    * https://worldmentalcalculation.com/2019/06/30/fastest-possible-processing-speed-of-the-human-brain/
    ** https://www.apa.org/research/action/multitask and https://blog.rescuetime.com/context-switching/

    View Slide

  15. Incident management
    Empathy
    ● Managing oncall team is managing a part of their private lives: when they can’t
    go to cinema, when they risk to be woken up
    ● Interaction within the team has to be sane and clean,
    ○ at some point these interactions will occur during personal time (override,
    escalation, collaboration)
    This can work only by developing empathy

    View Slide

  16. Incident management
    Steps
    - Receive an alert
    - Evaluate/Prioritize
    - Communicate
    - Manage Time
    - Manage Fatigue
    - Quick fix
    - Debrief
    - Report
    - Improve
    Incident time
    std business hours

    View Slide

  17. Incident management
    Receive Alert
    ● Silent / snooze known alerts and further incoming related
    alerts, to avoid interruptions, to limit noise.

    View Slide

  18. Incident management
    Evaluate / prioritize
    ● Do I really have to wake up (yes, too late) and resolve
    this now?
    ● If I already work on another alert, which one to
    prioritize ?
    ● What’s the business impact / risk ?
    ● Is data at risk ?
    ● Should I escalate ?

    View Slide

  19. Incident management
    Communicate
    ● Several level of communication:
    ○ Public / client-facing
    ○ Internal / compagny
    ○ Internal / tech team
    ○ External / Provider
    ● Easier when organization has specific team (support) for communication to
    clients so SRE can concentrate on internal communication

    View Slide

  20. Incident management
    Communicate
    ● Communicate as broader as you can “Issue detected on a component so this
    function is slowlier than usual and we’ll have to estimate if data has to be
    restored, the team is on it, no ETA, next status in 30min”
    ● The message should contain:
    ○ what you know when a first impact estimation has been done, but also
    things you don’t know.
    ○ estimated time of resolution, or set a recurrence for the messages.
    ○ Not easy to communicate bad news, but very important as it buys you
    uninterrupted time to fix the problem : your well-educated colleagues will
    wait till your next message without interrupting you with questions

    View Slide

  21. Incident management
    Communicate
    ● Log everything you’re doing :
    ○ Actions (command lines)
    ○ Thoughts, doubts, ideas, attempts, firsts interpretations
    ○ Links / screenshots of the graphs you observe (ie. give your sources)
    ○ If you post a link, always add a comment.
    ○ A log must include a timestamp
    ● A chan slack is fine, in order to share this for handover or for later post-mortem

    View Slide

  22. Incident management
    Communicate
    ● Especially on Slack: Write sentences ! subject + verb + complements
    ● Writing is needed as an asynchronous way of communication, even on a
    chat mode.
    ● We need background, contextualization. For instance for the rest of the
    team that will backlog what happened during the night

    View Slide

  23. Incident management
    Time Management
    ● SLA doesn’t matter during incident, besides of estimating
    criticality/escalations needs.
    ● It’s too late to wonder if we meet SLAs, now time have to be used in best
    effort mode.
    ● What’s important: how you progress in implementing a fix as quick as
    possible with current means.
    ● “As quick as possible”: Do you progress ? Don’t repeat the same actions.
    “Insanity is doing the same thing over and over again and expecting different
    results.” (not really from Einstein) (valid in science, not in training/learning conditions)

    View Slide

  24. Incident management
    Time Management
    ● Timebox your work, set a time for iteration :
    ○ 5min to 3h, depending on impact/criticity and ease to solve.
    Determine what can be a good timebox before incident.
    ● When an iteration ends, stop the operational work and re-evaluate
    incident from beginning : evaluate, communicate, manage time, manage
    fatigue, imagine another quick fix.

    View Slide

  25. Incident management
    Fatigue Management
    Don’t work if it’s not absolutely necessary. Don’t lose sleep time
    Escalation
    ● Escalate if you are blocked for 2 timeboxes: no progress
    ● Escalate if you have a doubt you can’t remove it by yourself. Even for simplest
    operations, sometimes.
    ● Evaluate your capacity to work at least 2 more timeboxes.
    ● If too tired, Escalate
    ○ If there is nobody to escalate, notice the hierarchy and have some rest.

    View Slide

  26. Incident management
    Fatigue Management
    Recovery
    ● Take time to recover from previous night, try to sleep.
    ● Most of the time, adrenalin provided during the incident helps to wake up
    quite early. But then ensure to take your afternoon off, there’s always an
    aftereffect.
    Shifts
    ● Allow temporary overrides and don’t judge the reason why the oncall
    person need this override.

    View Slide

  27. Incident management
    Quick fix
    ● Quick fix has to be quick, it doesn’t have to be clean, just reliable enough to
    resist until team can implement a better solution.
    ● We don’t have time to lose:
    ○ Human fatigue costs
    ○ SLA costs
    ○ Engineering costs (doing oncall is not doing engineering)
    ● Set causality relations between items
    ● Take decisions. If doubt or hesitation, escalate (it’s a decision too)

    View Slide

  28. Incident management
    PostMortem
    This moment is decisive, this is where you decide if the incident weaken or
    strengthen platform and organization
    ● Quick fix: buy time : service is up, people can sleep, clients can access the
    service, pressure is low, allowing concentrate on long term solutions
    ● Long term fix: automatization, gain of time
    ● Repeated quick fix: loss of time

    View Slide

  29. Incident management
    PostMortem
    There is always a long term fix, find it ! Then ensure its implementation is
    inserted in the backlog with the correct priority
    ● Quickly after incident, Put people of the incident in a room, write down
    things, look for improvements.
    ● 5W’s technique to describe incident.
    ● Each incident needs its postmortem. For improvement and also reporting
    ● Log how much time has been spent on incident

    View Slide

  30. Incident management
    PostMortem Culture
    ● Mindset : Blameless culture: addressing work issues without attacking
    persons.
    ● Accept failure, assume they are doing the best they can
    ● So you can concentrate on being factual, and look for future improvement
    ● Instead of “I should have communicated more, I’m sorry”
    ● Say : ”We haven’t communicated enough. Next time we’ll communicate
    more regularly”.
    ● As a consequence, accountability comes naturally

    View Slide

  31. Incident management
    Reporting
    ● We have to report on how difficult is the oncall,
    ○ To be able to defend recruitments, engineering time on a roadmap,
    empathy about fatigue, importance of SRE principles for devs
    ● Split the platforms in business categories to assign post-mortems, to spot
    where the time is spent during oncall.
    ● It creates a way to prioritize what can be done next for platform stability.
    ● When doing monthly report, analyze and ensure incidents are solved from
    1 month to another.

    View Slide

  32. Incident management
    What we’ve learnt
    ● Timebox to ensure progress in resolution
    ● Communicate to enable teamwork to solve complex issues
    ● Be blameless to address work problems and improve

    View Slide

  33. Thank you !
    Questions ?

    View Slide