$30 off During Our Annual Pro Sale. View Details »

Helping operations top-heavy teams the smart way (SF Reliability Engineering Meetup May 2018)

Helping operations top-heavy teams the smart way (SF Reliability Engineering Meetup May 2018)

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

Michael

May 16, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. Helping operations top-heavy teams
    the smart way
    (Lessons from my experience being loaned out to SRE teams)
    Michael Kehoe
    Staff Site Reliability Engineer

    View Slide

  2. Michael Kehoe
    $ WHOAMI
    • Staff Site Reliability Engineer @ LinkedIn
    • Production-SRE Team
    • Funny accent = Australian + 4 years American
    • Former Network Engineer at the University of
    Queensland

    View Slide

  3. Production-SRE Team @ LinkedIn
    $ WHOAMI
    • Disaster Recovery - Planning & Automation
    • Incident Response – Process & Automation
    • Visibility Engineering – Making use of
    operational data
    • Reliability Principles – Defining best practice
    & automating it

    View Slide

  4. • How to quickly erase all your
    technical debt
    • How to change your engineering
    culture
    This talk is not

    View Slide

  5. • How to identify team anti-patterns
    • How to work through high-toil
    • How to create sustainable workloads
    This talk is

    View Slide

  6. Today’s
    agenda
    1 Background
    2 Scenario 1: Resource Allocation
    3 Scenario 2: Technical Debt
    4 Scenario 3: High Toil
    5 Building A Formula For Success
    6 Key Learnings
    7 Q&A

    View Slide

  7. Background

    View Slide

  8. Personal Experience in the past 15 months
    ASSISTANCE RENDERED
    • Traffic-SRE: Resource Allocation
    • Voyager-SRE: Technical Debt
    • Capacity War-room
    • Espresso-SRE: Reliability

    View Slide

  9. Scenario 1: Resource
    Allocation

    View Slide

  10. Problem Statement
    Resource Allocations
    • Lack of written documentation
    • Backlog of work for clients
    • Alert Fatigue

    View Slide

  11. Scenario 2: Technical Debt

    View Slide

  12. Problem Statement
    Technical Debt
    • New frontend service
    • Understanding performance is
    complicated
    • Management of dependent services
    was difficult

    View Slide

  13. Scenario 3: High toil

    View Slide

  14. Problem Statement
    High Toil
    • Large multi-tenant/ multi-cluster
    database team
    • Lack of maturity in team-specific
    automation
    • Alert Fatigue

    View Slide

  15. Building a formula for success

    View Slide

  16. Code Yellow

    View Slide

  17. Building a formula for success
    Define the areas that
    need attacking
    Problem Statement
    Communicate
    expectations with
    clients & partners
    Commutation &
    Partnerships
    Define success
    criteria
    Exit Criteria
    Get the help that you
    require
    Resource Acquisition
    Plan for short-term &
    long-term
    Planning

    View Slide

  18. Define the areas that need attacking
    Problem Statement
    • Admit there is a problem
    • Measure the problem
    • Understand the problem
    • Determines underlying causes that
    need to be fixed
    Building a formula for success

    View Slide

  19. Define success criteria
    Exit Criteria
    • Define concrete goals
    • Define concrete success criteria
    • Measure via an operational metric
    • Measure via a project being
    completed
    • Define timelines for completion
    Building a formula for success

    View Slide

  20. Get the help you require
    Resource Acquisition
    • Ask other teams for help
    • Get dedicated engineers/ project
    managers/ other roles as required
    • Set exit-date for resources
    Building a formula for success

    View Slide

  21. Plan for the short-term & long-term
    Planning
    • Plan out short-term work
    • Plan out longer-term projects
    • Do they need to be rescheduled?
    • Prioritize work that will reduce toil &
    burnout (Automation +
    Measurement)
    Building a formula for success

    View Slide

  22. Communicate expectations with clients
    & partners
    Communication
    & Partnerships
    • Communicate problem statement &
    exit criteria
    • Send regular progress updates
    • Ensure that stakeholders understand
    delays & expected outcomes
    Building a formula for success

    View Slide

  23. When Operations Isn’t Perfect
    Code Yellow
    https://devops.com/code-yellow-when-operations-isnt-perfect/

    View Slide

  24. Key Learnings

    View Slide

  25. Key Learnings
    Measure toil/ overhead
    Measure
    Prioritize efforts to
    remove overhead/toil
    Prioritize
    Communicate with
    partners & teams
    Communicate

    View Slide

  26. Q&A

    View Slide

  27. View Slide