Doing SRE the right way

Site Reliability Engineering (SRE) is an expensive proposition for small and large organizations alike. You can mitigate the costs by:

Setting up the right organizational structure for SRE
Hiring for the right team composition
Building the SRE culture

Piyush Verma, co-founder of Last9 Inc shares practical insights and wisdom, from accumulated successful and not-so-successful experiences.


Piyush Verma

June 16, 2020


  3. Failure Cycle of Physical Products

  4. Failure Cycle of Software Products

  5. Feature Upgrade vs Stability

  6. Some approaches to build reliability • Ignorance • Throw more

    engineers at the problem • Recovery Blocks • NVersion Programming
  7. Instead, what we need is... Self-checking software [Michael Lyu] Observability

    Comprehensive metrics And, Control Policies that adhere to Production
  8. Software Operations should be treated … Like a Product

  9. Software Products vs Software Operations? Product • Product Manager •

    Customer feedback • Product metrics • Google analytics • Conversion funnels Operations • One failure at a time • Ad-hoc increments • On-call softwares • Re-write • Kubernetes • Data driven decisions • Prioritize your failures • Put a value to the loss
  10. What do you see?

  11. Viewing from a SRE lens: • 35% have not scaled

    *enough* yet. • 50% have to wait before they scale • 25% are paying a person to do this • 35% are about to hit this problem at some point
  12. Data We Gather (1

  13. Data We Produce (2

  14. Site Reliability Engineering Where do we start?

  15. Reliability (Quality → Metric)

  16. Robust vs Reliable - Should be up? - Should be

    up AND not serve errors? - Should be up AND not serve errors AND serve correct data? - Should be up AND not serve errors AND serve correct data AND within no time? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS? - Should be up AND not serve errors AND serve correct data AND within no time AND serve unlimited TPS AND concurrently?
  17. Reliability is not a buffet lunch.

  18. There is no bug-free software There is no 100% reliable

  19. Step 1 Service Level Indicators

  20. Step2 Service Level Objectives

  21. Service Level Objectives Reliability Release Velocity

  22. How to set SLOs • Measurable • Customer-oriented • Challenging

    • Unambiguous • All Stakeholders participate
  23. How to set SLOs - Should be up through the

    day. Downtime allowed 3  315 AM - Only 1% requests should have status code  4xx - Only 0.05% of /deep-health POST  GET can mismatch - Only 1% requests should be slower than 100ms - Numbers should hold upto 100 TPs - Numbers should hold upto 10 concurrent requests
  24. SLOs detailed A SLO of 99% of a service with

    1 million requests in a month; allows for 10000 failures. If current failure counter == 9000 and there is still a fortnight left; All hands, towards stability. If current failure counter == 500, release as often as you want.
  25. SLOs detailed • Choosing a time window • Choosing aggregations

    • Window lengths
  26. You can only do so much, manually. Enter Site Reliability

  27. So what does a SRE do? • When an outage

    happens, minimize the downtime • Where else is this happening? • Prevent repetition
  28. The SRE Recipe • Observability - 2 tbsp • Control

    - 2 tbsp • Automation - 1 tbsp • Root Cause Analysis - 3 tbsp • Cross-org collaboration - 3 tbsp • Guard Rails/Frameworks - 4 tbsp
  29. Isn’t K8s and cloud already automated? • If a human

    does a thing 3x over • 3 different results • 3 different bugs • 1 highly demotivated employee
  30. When/Do I need that much automation? Servers can scale, people

  31. Right time to focus on SRE? when($ spend < $

  32. December 1, 1913 12 hours → 150 minutes

  33. But that’s exactly how they sold DevOps! DevOps is the

    goal. SRE is a way to get there. Functional Programming: Software Development :: SRE : DevOps
  34. So, what did SRE change? • Feature development is owned

    by Product Developers • Quality is owned by QA / Release Managers • Who owned the Uptime? • Especially when choosing between Stability vs Feature Release • Uptime didn’t have owners • Who owns Capacity Plan and spend? • More felt over the last 2 decades • Shops remain open 247 • Consumption is 247
  35. SRE brings to software development what assembly line brought to

    manufacturing. Over 60% failures were bad deployments and 38% issues were repeated. SRE practice helped cut down operations cost by 60% within a year.
  36. Reliability is not a buffet lunch. Go A-la-carte

  37. Tiered SRE approach Sporadic work, no dedicated SRE staffing (Tier

    0 Put observability in place Projects, some dedicated SRE time (Tier 1 Aim to unify dataset across services to build common improvements. Onboard one service and on-call time (Tier 2 Setup on-call and escalation matrix. See how teams and services adopt Onboard other services. Reuse SRE tooling and practices (Tier 3 Go org-wide
  38. SRE team structures 1SRE Common Tooling Embedded Outsourced

  39. SRE Maturity Model 9x Beginner • Canary/Rolling • Coverage Validations

    • SLI Collection • On-Call • Runbooks Fault Tolerance • 9x + • Tolerance Proof • SLO • Disaster Recovery • Rollback 99 Automated • 99 + • Chaos Engg • Capacity Plans • Pattern based re-architect. 99.9 Automatic • 99.9 + • Failure Mode effect analysis • Cost Analysis • If this then That • Auto Healing 99.99
  40. Key Skills of an SRE • Withstand Boredom • Attention

    towards detail • *nix • Programming • Shell scripting • On-call experience (real or assisted) • Statistics
  41. These skills will rob my bank. Not an SRE

  42. How do you train SREs Based on organizational size and

    maturity • Train on Job • Self-Learn • Throw them in the Ocean SRE
  43. • There is a cost to reliability • 100% reliability

    is a myth • Reliability can (and should) be achieved incrementally • Reliability is improved in Dev sprints • Reliability needs participation from every stakeholder • Reliability is a function. Someone happened to call it SRE. Things to take away
  44. Thank you piyush@last9.io @realmeson10 last9.io