Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Helping operations top-heavy teams the smart wa...

Helping operations top-heavy teams the smart way (SF Reliability Engineering Meetup May 2018)

SRE teams can sometimes run into periods of time where they have staff burnout, technical debt or poor reliability. As SRE’s, we’re programmed to keep fighting through the issues, when sometimes it’s best to step back, assess the situation; and ask for help to put the team back on a successful pathway. This talk will discuss three separate experiences where teams needed some extra help to stabilize their services and oncall. We’ll discuss how to identify struggling teams; get the right assistance; and build a strategy for the team to succeed.

Michael

May 16, 2018
Tweet

More Decks by Michael

Other Decks in Technology

Transcript

  1. Helping operations top-heavy teams the smart way (Lessons from my

    experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer
  2. Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @

    LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland
  3. Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery -

    Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it
  4. • How to quickly erase all your technical debt •

    How to change your engineering culture This talk is not
  5. • How to identify team anti-patterns • How to work

    through high-toil • How to create sustainable workloads This talk is
  6. Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3

    Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A
  7. Personal Experience in the past 15 months ASSISTANCE RENDERED •

    Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability
  8. Problem Statement Technical Debt • New frontend service • Understanding

    performance is complicated • Management of dependent services was difficult
  9. Problem Statement High Toil • Large multi-tenant/ multi-cluster database team

    • Lack of maturity in team-specific automation • Alert Fatigue
  10. Building a formula for success Define the areas that need

    attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning
  11. Define the areas that need attacking Problem Statement • Admit

    there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success
  12. Define success criteria Exit Criteria • Define concrete goals •

    Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success
  13. Get the help you require Resource Acquisition • Ask other

    teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success
  14. Plan for the short-term & long-term Planning • Plan out

    short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success
  15. Communicate expectations with clients & partners Communication & Partnerships •

    Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success
  16. Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove

    overhead/toil Prioritize Communicate with partners & teams Communicate
  17. Q&A