Helping operations top-heavy teams the smart way (SF Reliability Engineering Meetup May 2018)

Helping operations top-heavy teams the smart way (Lessons from my
experience being loaned out to SRE teams) Michael Kehoe Staff Site Reliability Engineer

Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @
LinkedIn • Production-SRE Team • Funny accent = Australian + 4 years American • Former Network Engineer at the University of Queensland

Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery -
Planning & Automation • Incident Response – Process & Automation • Visibility Engineering – Making use of operational data • Reliability Principles – Defining best practice & automating it

• How to quickly erase all your technical debt •
How to change your engineering culture This talk is not

• How to identify team anti-patterns • How to work
through high-toil • How to create sustainable workloads This talk is

Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3
Scenario 2: Technical Debt 4 Scenario 3: High Toil 5 Building A Formula For Success 6 Key Learnings 7 Q&A

Background

Personal Experience in the past 15 months ASSISTANCE RENDERED •
Traffic-SRE: Resource Allocation • Voyager-SRE: Technical Debt • Capacity War-room • Espresso-SRE: Reliability

Scenario 1: Resource Allocation

Problem Statement Resource Allocations • Lack of written documentation •
Backlog of work for clients • Alert Fatigue

Scenario 2: Technical Debt

Problem Statement Technical Debt • New frontend service • Understanding
performance is complicated • Management of dependent services was difficult

Scenario 3: High toil

Problem Statement High Toil • Large multi-tenant/ multi-cluster database team
• Lack of maturity in team-specific automation • Alert Fatigue

Building a formula for success

Code Yellow

Building a formula for success Define the areas that need
attacking Problem Statement Communicate expectations with clients & partners Commutation & Partnerships Define success criteria Exit Criteria Get the help that you require Resource Acquisition Plan for short-term & long-term Planning

Define the areas that need attacking Problem Statement • Admit
there is a problem • Measure the problem • Understand the problem • Determines underlying causes that need to be fixed Building a formula for success

Define success criteria Exit Criteria • Define concrete goals •
Define concrete success criteria • Measure via an operational metric • Measure via a project being completed • Define timelines for completion Building a formula for success

Get the help you require Resource Acquisition • Ask other
teams for help • Get dedicated engineers/ project managers/ other roles as required • Set exit-date for resources Building a formula for success

Plan for the short-term & long-term Planning • Plan out
short-term work • Plan out longer-term projects • Do they need to be rescheduled? • Prioritize work that will reduce toil & burnout (Automation + Measurement) Building a formula for success

Communicate expectations with clients & partners Communication & Partnerships •
Communicate problem statement & exit criteria • Send regular progress updates • Ensure that stakeholders understand delays & expected outcomes Building a formula for success

When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/

Key Learnings

Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove
overhead/toil Prioritize Communicate with partners & teams Communicate

Helping operations top-heavy teams the smart wa...

Helping operations top-heavy teams the smart way (SF Reliability Engineering Meetup May 2018)

Michael

More Decks by Michael

Other Decks in Technology

Featured

Transcript

Helping operations top-heavy teams the smart way (Lessons from my

Michael Kehoe $ WHOAMI • Staff Site Reliability Engineer @

Production-SRE Team @ LinkedIn $ WHOAMI • Disaster Recovery -

• How to quickly erase all your technical debt •

• How to identify team anti-patterns • How to work

Today’s agenda 1 Background 2 Scenario 1: Resource Allocation 3

Background

Personal Experience in the past 15 months ASSISTANCE RENDERED •

Scenario 1: Resource Allocation

Problem Statement Resource Allocations • Lack of written documentation •

Scenario 2: Technical Debt

Problem Statement Technical Debt • New frontend service • Understanding

Scenario 3: High toil

Problem Statement High Toil • Large multi-tenant/ multi-cluster database team

Building a formula for success

Code Yellow

Building a formula for success Define the areas that need

Define the areas that need attacking Problem Statement • Admit

Define success criteria Exit Criteria • Define concrete goals •

Get the help you require Resource Acquisition • Ask other

Plan for the short-term & long-term Planning • Plan out

Communicate expectations with clients & partners Communication & Partnerships •

When Operations Isn’t Perfect Code Yellow https://devops.com/code-yellow-when-operations-isnt-perfect/

Key Learnings

Key Learnings Measure toil/ overhead Measure Prioritize efforts to remove

Q&A