Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way

Doing SRE the right way

Site Reliability Engineering (SRE) is an expensive proposition for small and large organizations alike. You can mitigate the costs by:

Setting up the right organizational structure for SRE
Hiring for the right team composition
Building the SRE culture

Piyush Verma, co-founder of Last9 Inc shares practical insights and wisdom, from accumulated successful and not-so-successful experiences.


Piyush Verma

June 16, 2020


  1. Grooming Organizational structures, teams, capacity and culture Piyush Verma @realmeson10

    Doing SRE the Right Way ⅓
  2. 1. What is SRE? 2. Vs DevOps? 3. How much

    does reliability cost? 4. How long before I am reliable? 5. Isn’t it too early to automate? 6. What does an SRE code? 7. Is Reliability achieved before/after release? 8. What is the right SRE organization structure? .. .. .. .. ask in comments FAQs
  3. Failure Cycle of Physical Products

  4. Failure Cycle of Software Products

  5. Feature Upgrade vs Stability

  6. Some approaches to build reliability • Ignorance • Throw more

    engineers at the problem • Recovery Blocks • NVersion Programming
  7. Instead, what we need is... Self-checking software [Michael Lyu] Observability

    Comprehensive metrics And, Control Policies that adhere to Production
  8. Software Operations should be treated … Like a Product

  9. Software Products vs Software Operations? Product • Product Manager •

    Customer feedback • Product metrics • Google analytics • Conversion funnels Operations • One failure at a time • Ad-hoc increments • On-call softwares • Re-write • Kubernetes • Data driven decisions • Prioritize your failures • Put a value to the loss
  10. What do you see?

  11. Viewing from a SRE lens: • 35% have not scaled

    *enough* yet. • 50% have to wait before they scale • 25% are paying a person to do this • 35% are about to hit this problem at some point
  12. Data We Gather (1

  13. Data We Produce (2

  14. Site Reliability Engineering Where do we start?

  15. Reliability (Quality → Metric)

  16. Robust vs Reliable - Should be up? - Should be

    up AND not serve errors? - Should be up AND not serve errors AND serve correct data? - Should be up AND not serve errors AND serve correct data AND within no time? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS AND concurrently?
  17. Reliability is not a buffet lunch.

  18. There is no bug-free software There is no 100% reliable

  19. Step 1 Service Level Indicators

  20. Step2 Service Level Objectives

  21. Service Level Objectives Reliability Release Velocity

  22. How to set SLOs • Measurable • Customer-oriented • Challenging

    • Unambiguous • All Stakeholders participate
  23. How to set SLOs - Should be up through the

    day. Downtime allowed 3  315 AM - Only 1% requests should have status code  4xx - Only 0.05% of /deep-health POST  GET can mismatch - Only 1% requests should be slower than 100ms - Numbers should hold upto 100 TPs - Numbers should hold upto 10 concurrent requests
  24. SLOs detailed A SLO of 99% of a service with

    1 million requests in a month; allows for 10000 failures. If current failure counter == 9000 and there is still a fortnight left; All hands, towards stability. If current failure counter == 500, release as often as you want.
  25. SLOs detailed • Choosing a time window • Choosing aggregations

    • Window lengths
  26. You can only do so much, manually. Enter Site Reliability

  27. So what does a SRE do? • When an outage

    happens, minimize the downtime • Where else is this happening? • Prevent repetition
  28. The SRE Recipe • Observability - 2 tbsp • Control

    - 2 tbsp • Automation - 1 tbsp • Root Cause Analysis - 3 tbsp • Cross-org collaboration - 3 tbsp • Guard Rails/Frameworks - 4 tbsp
  29. Isn’t K8s and cloud already automated? • If a human

    does a thing 3x over • 3 different results • 3 different bugs • 1 highly demotivated employee
  30. When/Do I need that much automation? Servers can scale, people

  31. Right time to focus on SRE? when($ spend < $

  32. December 1, 1913 12 hours → 150 minutes

  33. But that’s exactly how they sold DevOps! DevOps is the

    goal. SRE is a way to get there. Functional Programming: Software Development :: SRE : DevOps
  34. So, what did SRE change? • Feature development is owned

    by Product Developers • Quality is owned by QA / Release Managers • Who owned the Uptime? • Especially when choosing between Stability vs Feature Release • Uptime didn’t have owners • Who owns Capacity Plan and spend? • More felt over the last 2 decades • Shops remain open 247 • Consumption is 247
  35. SRE brings to software development what assembly line brought to

    manufacturing. Over 60% failures were bad deployments and 38% issues were repeated. SRE practice helped cut down operations cost by 60% within a year.
  36. Reliability is not a buffet lunch. Go A-la-carte

  37. Tiered SRE approach Sporadic work, no dedicated SRE staffing (Tier

    0 Put observability in place Projects, some dedicated SRE time (Tier 1 Aim to unify dataset across services to build common improvements. Onboard one service and on-call time (Tier 2 Setup on-call and escalation matrix. See how teams and services adopt Onboard other services. Reuse SRE tooling and practices (Tier 3 Go org-wide
  38. SRE team structures 1SRE Common Tooling Embedded Outsourced

  39. SRE Maturity Model 9x Beginner • Canary/Rolling • Coverage Validations

    • SLI Collection • On-Call • Runbooks Fault Tolerance • 9x + • Tolerance Proof • SLO • Disaster Recovery • Rollback 99 Automated • 99 + • Chaos Engg • Capacity Plans • Pattern based re-architect. 99.9 Automatic • 99.9 + • Failure Mode effect analysis • Cost Analysis • If this then That • Auto Healing 99.99
  40. Key Skills of an SRE • Withstand Boredom • Attention

    towards detail • *nix • Programming • Shell scripting • On-call experience (real or assisted) • Statistics
  41. These skills will rob my bank. Not an SRE

  42. How do you train SREs Based on organizational size and

    maturity • Train on Job • Self-Learn • Throw them in the Ocean SRE
  43. • There is a cost to reliability • 100% reliability

    is a myth • Reliability can (and should) be achieved incrementally • Reliability is improved in Dev sprints • Reliability needs participation from every stakeholder • Reliability is a function. Someone happened to call it SRE. Things to take away
  44. Thank you piyush@last9.io @realmeson10 last9.io