$30 off During Our Annual Pro Sale. View Details »

Helping operations top-heavy teams the smart way

Michael
October 29, 2018

Helping operations top-heavy teams the smart way

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.

We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Michael

October 29, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. Helping operations top-heavy teams
    the smart way
    Jeff Weiner
    Chief Executive Officer
    Michael Kehoe
    Staff Site Reliability Engineer
    Todd Palino
    Sr Staff Site Reliability Engineer

    View Slide

  2. This Is The Only Slide You May Need a Picture Of
    slideshare.net/ToddPalino slideshare.net/MichaelKehoe3

    View Slide

  3. Michael Kehoe
    $ WHOAMI
    • Staff Site Reliability Engineer @ LinkedIn
    • Production-SRE Team
    • Funny accent = Australian + 4 years American
    • Former Network Engineer at the University of
    Queensland

    View Slide

  4. Todd Palino
    $ WHOAMI
    • Senior Staff SRE @ LinkedIn
    • Capacity Engineering Team
    • Co-Author of Kafka: The Definitive Guide
    • Late of VeriSign Infrastructure Engineering

    View Slide

  5. When Operations Isn’t Perfect
    Code Yellow
    https://devops.com/code-yellow-when-operations-isnt-perfect/

    View Slide

  6. • How to quickly erase all your
    technical debt
    • How to change your engineering
    culture
    This talk is not

    View Slide

  7. • How to identify team anti-patterns
    • How to work through high toil
    • How to create sustainable workloads
    This talk is

    View Slide

  8. Today’s
    agenda
    1 Background
    2 Scenario 1: Traffic-SRE
    3 Scenario 2: Kafka-SRE
    4 Building A Formula For Success
    5 Key Learnings
    6 Q&A

    View Slide

  9. Background

    View Slide

  10. Personal Experience in the past 19 months
    ASSISTANCE RENDERED
    • Traffic-SRE: Technical Debt/ Resource
    Allocation
    • Voyager-SRE: Technical Debt
    • Capacity War-room
    • Espresso-SRE: Reliability
    • Kafka-SRE: Capacity and Alert Fatigue

    View Slide

  11. Scenario 1: Traffic-SRE

    View Slide

  12. Problem Statement
    Technical Debt
    • Written documentation needed
    improvement
    • Deployment infrastructure needed
    investment
    • Alert Fatigue
    Traffic-SRE

    View Slide

  13. Problem Statement
    Resource Allocations
    • Backlog of work for clients
    • Staff shortage

    View Slide

  14. Scenario 2: Kafka

    View Slide

  15. View Slide

  16. Problem Statement
    Capacity Planning
    • Multi-tenant Infrastructure
    • No resource controls
    • Unclear resource ownership
    • Ad-hoc capacity planning
    • Sudden 100% increase in traffic

    View Slide

  17. Problem Statement
    Alert Fatigue
    • Multiple applications overutilized
    • No time for proactive work
    • Most alerts non-actionable

    View Slide

  18. Building a formula for success

    View Slide

  19. Code Yellow

    View Slide

  20. Building a formula for success
    Define the areas that
    need attacking
    Problem Statement
    Communicate
    expectations with
    clients & partners
    Communication &
    Partnerships
    Define success
    criteria
    Exit Criteria
    Get the help that you
    require
    Resource Acquisition
    Plan for short-term &
    long-term
    Planning

    View Slide

  21. Define the areas that need attacking
    Problem Statement
    • Admit there is a problem
    • Measure the problem
    • Understand the problem
    • Determines underlying causes that
    need to be fixed
    Building a formula for success

    View Slide

  22. Define success criteria
    Exit Criteria
    • Define concrete goals
    • Define concrete success criteria
    • Measure via an operational metric
    • Measure via a project being
    completed
    • Define timelines for completion
    Building a formula for success

    View Slide

  23. Get the help you require
    Resource Acquisition
    • Ask other teams for help
    • Get dedicated engineers/ project
    managers/ other roles as required
    • Set exit-date for resources
    Building a formula for success

    View Slide

  24. Plan for the short-term & long-term
    Planning
    • Plan out short-term work
    • Plan out longer-term projects
    • Do they need to be rescheduled?
    • Prioritize work that will reduce toil &
    burnout (Automation +
    Measurement)
    Building a formula for success

    View Slide

  25. Communicate expectations with clients
    & partners
    Communication
    & Partnerships
    • Communicate problem statement &
    exit criteria
    • Send regular progress updates
    • Ensure that stakeholders understand
    delays & expected outcomes
    Building a formula for success

    View Slide

  26. Key Learnings

    View Slide

  27. Key Learnings
    Measure toil/ overhead
    Measure
    Prioritize efforts to
    remove overhead/toil
    Prioritize
    Communicate with
    partners & teams
    Communicate

    View Slide

  28. Q&A

    View Slide

  29. View Slide