Doing SRE the right way

Slide 1

Slide 1 text

Grooming Organizational structures, teams, capacity and culture Piyush Verma @realmeson10 Doing SRE the Right Way ⅓

Slide 2

Slide 2 text

1. What is SRE? 2. Vs DevOps? 3. How much does reliability cost? 4. How long before I am reliable? 5. Isn’t it too early to automate? 6. What does an SRE code? 7. Is Reliability achieved before/after release? 8. What is the right SRE organization structure? .. .. .. .. ask in comments FAQs

Slide 3

Slide 3 text

Failure Cycle of Physical Products

Slide 4

Slide 4 text

Failure Cycle of Software Products

Slide 5

Slide 5 text

Feature Upgrade vs Stability

Slide 6

Slide 6 text

Some approaches to build reliability ● Ignorance ● Throw more engineers at the problem ● Recovery Blocks ● NVersion Programming

Slide 7

Slide 7 text

Instead, what we need is... Self-checking software [Michael Lyu] Observability Comprehensive metrics And, Control Policies that adhere to Production

Slide 8

Slide 8 text

Software Operations should be treated … Like a Product

Slide 9

Slide 9 text

Software Products vs Software Operations? Product • Product Manager • Customer feedback • Product metrics • Google analytics • Conversion funnels Operations • One failure at a time • Ad-hoc increments • On-call softwares • Re-write • Kubernetes • Data driven decisions • Prioritize your failures • Put a value to the loss

Slide 10

Slide 10 text

What do you see?

Slide 11

Slide 11 text

Viewing from a SRE lens: ● 35% have not scaled *enough* yet. ● 50% have to wait before they scale ● 25% are paying a person to do this ● 35% are about to hit this problem at some point

Slide 12

Slide 12 text

Data We Gather (1

Slide 13

Slide 13 text

Data We Produce (2

Slide 14

Slide 14 text

Site Reliability Engineering Where do we start?

Slide 15

Slide 15 text

Reliability (Quality → Metric)

Slide 16

Slide 16 text

Robust vs Reliable - Should be up? - Should be up AND not serve errors? - Should be up AND not serve errors AND serve correct data? - Should be up AND not serve errors AND serve correct data AND within no time? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS AND concurrently?

Slide 17

Slide 17 text

Reliability is not a buffet lunch.

Slide 18

Slide 18 text

There is no bug-free software There is no 100% reliable software

Slide 19

Slide 19 text

Step 1 Service Level Indicators

Slide 20

Slide 20 text

Step2 Service Level Objectives

Slide 21

Slide 21 text

Service Level Objectives Reliability Release Velocity

Slide 22

Slide 22 text

How to set SLOs • Measurable • Customer-oriented • Challenging • Unambiguous • All Stakeholders participate

Slide 23

Slide 23 text

How to set SLOs - Should be up through the day. Downtime allowed 3  315 AM - Only 1% requests should have status code  4xx - Only 0.05% of /deep-health POST  GET can mismatch - Only 1% requests should be slower than 100ms - Numbers should hold upto 100 TPs - Numbers should hold upto 10 concurrent requests

Slide 24

Slide 24 text

SLOs detailed A SLO of 99% of a service with 1 million requests in a month; allows for 10000 failures. If current failure counter == 9000 and there is still a fortnight left; All hands, towards stability. If current failure counter == 500, release as often as you want.

Slide 25

Slide 25 text

SLOs detailed • Choosing a time window • Choosing aggregations • Window lengths

Slide 26

Slide 26 text

You can only do so much, manually. Enter Site Reliability Engineering.

Slide 27

Slide 27 text

So what does a SRE do? • When an outage happens, minimize the downtime • Where else is this happening? • Prevent repetition

Slide 28

Slide 28 text

The SRE Recipe • Observability - 2 tbsp • Control - 2 tbsp • Automation - 1 tbsp • Root Cause Analysis - 3 tbsp • Cross-org collaboration - 3 tbsp • Guard Rails/Frameworks - 4 tbsp

Slide 29

Slide 29 text

Isn’t K8s and cloud already automated? • If a human does a thing 3x over • 3 different results • 3 different bugs • 1 highly demotivated employee

Slide 30

Slide 30 text

When/Do I need that much automation? Servers can scale, people cannot.

Slide 31

Slide 31 text

Right time to focus on SRE? when($ spend < $ lost)

Slide 32

Slide 32 text

December 1, 1913 12 hours → 150 minutes

Slide 33

Slide 33 text

But that’s exactly how they sold DevOps! DevOps is the goal. SRE is a way to get there. Functional Programming: Software Development :: SRE : DevOps

Slide 34

Slide 34 text

So, what did SRE change? • Feature development is owned by Product Developers • Quality is owned by QA / Release Managers • Who owned the Uptime? • Especially when choosing between Stability vs Feature Release • Uptime didn’t have owners • Who owns Capacity Plan and spend? • More felt over the last 2 decades • Shops remain open 247 • Consumption is 247

Slide 35

Slide 35 text

SRE brings to software development what assembly line brought to manufacturing. Over 60% failures were bad deployments and 38% issues were repeated. SRE practice helped cut down operations cost by 60% within a year.

Slide 36

Slide 36 text

Reliability is not a buffet lunch. Go A-la-carte

Slide 37

Slide 37 text

Tiered SRE approach Sporadic work, no dedicated SRE staffing (Tier 0 Put observability in place Projects, some dedicated SRE time (Tier 1 Aim to unify dataset across services to build common improvements. Onboard one service and on-call time (Tier 2 Setup on-call and escalation matrix. See how teams and services adopt Onboard other services. Reuse SRE tooling and practices (Tier 3 Go org-wide

Slide 38

Slide 38 text

SRE team structures 1SRE Common Tooling Embedded Outsourced

Slide 39

Slide 39 text

SRE Maturity Model 9x Beginner ● Canary/Rolling ● Coverage Validations ● SLI Collection ● On-Call ● Runbooks Fault Tolerance ● 9x + ● Tolerance Proof ● SLO ● Disaster Recovery ● Rollback 99 Automated ● 99 + ● Chaos Engg ● Capacity Plans ● Pattern based re-architect. 99.9 Automatic ● 99.9 + ● Failure Mode effect analysis ● Cost Analysis ● If this then That ● Auto Healing 99.99

Slide 40

Slide 40 text

Key Skills of an SRE ● Withstand Boredom ● Attention towards detail ● *nix ● Programming ● Shell scripting ● On-call experience (real or assisted) ● Statistics

Slide 41

Slide 41 text

These skills will rob my bank. Not an SRE

Slide 42

Slide 42 text

How do you train SREs Based on organizational size and maturity • Train on Job • Self-Learn • Throw them in the Ocean SRE

Slide 43

Slide 43 text

● There is a cost to reliability ● 100% reliability is a myth ● Reliability can (and should) be achieved incrementally ● Reliability is improved in Dev sprints ● Reliability needs participation from every stakeholder ● Reliability is a function. Someone happened to call it SRE. Things to take away

Slide 44

Slide 44 text

Thank you [email protected] @realmeson10 last9.io