Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way

Doing SRE the right way

Site Reliability Engineering (SRE) is an expensive proposition for small and large organizations alike. You can mitigate the costs by:

Setting up the right organizational structure for SRE
Hiring for the right team composition
Building the SRE culture

Piyush Verma, co-founder of Last9 Inc shares practical insights and wisdom, from accumulated successful and not-so-successful experiences.

Piyush Verma

June 16, 2020
Tweet

More Decks by Piyush Verma

Other Decks in Programming

Transcript

  1. Grooming Organizational structures, teams,
    capacity and culture
    Piyush Verma
    @realmeson10
    Doing SRE the
    Right Way ⅓

    View Slide

  2. 1. What is SRE?
    2. Vs DevOps?
    3. How much does reliability cost?
    4. How long before I am reliable?
    5. Isn’t it too early to automate?
    6. What does an SRE code?
    7. Is Reliability achieved before/after release?
    8. What is the right SRE organization structure?
    .. .. .. .. ask in comments
    FAQs

    View Slide

  3. Failure Cycle of Physical Products

    View Slide

  4. Failure Cycle of Software Products

    View Slide

  5. Feature Upgrade vs Stability

    View Slide

  6. Some approaches to
    build reliability
    ● Ignorance
    ● Throw more engineers at the problem
    ● Recovery Blocks
    ● NVersion Programming

    View Slide

  7. Instead, what we
    need is...
    Self-checking software
    [Michael Lyu]
    Observability
    Comprehensive metrics
    And, Control
    Policies that adhere to Production

    View Slide

  8. Software Operations
    should be treated
    … Like a Product

    View Slide

  9. Software Products vs Software Operations?
    Product
    • Product Manager
    • Customer feedback
    • Product metrics
    • Google analytics
    • Conversion funnels
    Operations
    • One failure at a time
    • Ad-hoc increments
    • On-call softwares
    • Re-write
    • Kubernetes
    • Data driven decisions
    • Prioritize your failures
    • Put a value to the loss

    View Slide

  10. What do you see?

    View Slide

  11. Viewing from a SRE lens:
    ● 35% have not scaled *enough* yet.
    ● 50% have to wait before they scale
    ● 25% are paying a person to do this
    ● 35% are about to hit this problem at
    some point

    View Slide

  12. Data We Gather (1

    View Slide

  13. Data We Produce (2

    View Slide

  14. Site Reliability
    Engineering
    Where do we start?

    View Slide

  15. Reliability (Quality → Metric)

    View Slide

  16. Robust vs Reliable
    - Should be up?
    - Should be up AND not serve errors?
    - Should be up AND not serve errors AND serve correct data?
    - Should be up AND not serve errors AND serve correct data AND
    within no time?
    - Should be up AND not serve errors AND serve correct data AND
    within no time AND serve unlimited TPS?
    - Should be up AND not serve errors AND serve correct data AND
    within no time AND serve unlimited TPS AND concurrently?

    View Slide

  17. Reliability is not a
    buffet lunch.

    View Slide

  18. There is no bug-free
    software
    There is no 100%
    reliable software

    View Slide

  19. Step 1 Service Level Indicators

    View Slide

  20. Step2
    Service Level Objectives

    View Slide

  21. Service Level Objectives
    Reliability
    Release
    Velocity

    View Slide

  22. How to set SLOs
    • Measurable
    • Customer-oriented
    • Challenging
    • Unambiguous
    • All Stakeholders participate

    View Slide

  23. How to set SLOs
    - Should be up through the day. Downtime allowed 3  315 AM
    - Only 1% requests should have status code  4xx
    - Only 0.05% of /deep-health POST  GET can mismatch
    - Only 1% requests should be slower than 100ms
    - Numbers should hold upto 100 TPs
    - Numbers should hold upto 10 concurrent requests

    View Slide

  24. SLOs detailed
    A SLO of 99% of a service with 1 million requests in a month; allows for
    10000 failures.
    If current failure counter == 9000 and there is still a fortnight left; All
    hands, towards stability.
    If current failure counter == 500, release as often as you want.

    View Slide

  25. SLOs detailed
    • Choosing a time window
    • Choosing aggregations
    • Window lengths

    View Slide

  26. You can only do so
    much, manually.
    Enter Site Reliability Engineering.

    View Slide

  27. So what does a SRE do?
    • When an outage happens, minimize the downtime
    • Where else is this happening?
    • Prevent repetition

    View Slide

  28. The SRE Recipe
    • Observability - 2 tbsp
    • Control - 2 tbsp
    • Automation - 1 tbsp
    • Root Cause Analysis - 3 tbsp
    • Cross-org collaboration - 3 tbsp
    • Guard Rails/Frameworks - 4 tbsp

    View Slide

  29. Isn’t K8s and cloud already automated?
    • If a human does a thing 3x over
    • 3 different results
    • 3 different bugs
    • 1 highly demotivated employee

    View Slide

  30. When/Do I need that
    much automation?
    Servers can scale, people
    cannot.

    View Slide

  31. Right time to focus on SRE?
    when($ spend < $ lost)

    View Slide

  32. December 1, 1913
    12 hours → 150 minutes

    View Slide

  33. But that’s exactly how they sold DevOps!
    DevOps is the goal.
    SRE is a way to get there.
    Functional Programming: Software Development :: SRE : DevOps

    View Slide

  34. So, what did SRE change?
    • Feature development is owned by Product Developers
    • Quality is owned by QA / Release Managers
    • Who owned the Uptime?
    • Especially when choosing between Stability vs Feature Release
    • Uptime didn’t have owners
    • Who owns Capacity Plan and spend?
    • More felt over the last 2 decades
    • Shops remain open 247
    • Consumption is 247

    View Slide

  35. SRE brings to software development
    what assembly line brought to
    manufacturing.
    Over 60% failures were bad deployments and 38% issues were
    repeated.
    SRE practice helped cut down operations cost by 60% within a year.

    View Slide

  36. Reliability is not a
    buffet lunch.
    Go A-la-carte

    View Slide

  37. Tiered SRE approach
    Sporadic work, no dedicated SRE staffing (Tier 0
    Put observability in place
    Projects, some dedicated SRE time (Tier 1
    Aim to unify dataset across services to build common improvements.
    Onboard one service and on-call time (Tier 2
    Setup on-call and escalation matrix. See how teams and services adopt
    Onboard other services. Reuse SRE tooling and practices (Tier 3
    Go org-wide

    View Slide

  38. SRE team structures
    1SRE Common
    Tooling
    Embedded
    Outsourced

    View Slide

  39. SRE Maturity Model
    9x
    Beginner
    ● Canary/Rolling
    ● Coverage
    Validations
    ● SLI Collection
    ● On-Call
    ● Runbooks
    Fault Tolerance
    ● 9x +
    ● Tolerance Proof
    ● SLO
    ● Disaster
    Recovery
    ● Rollback
    99
    Automated
    ● 99 +
    ● Chaos Engg
    ● Capacity Plans
    ● Pattern based
    re-architect.
    99.9
    Automatic
    ● 99.9 +
    ● Failure Mode
    effect analysis
    ● Cost Analysis
    ● If this then That
    ● Auto Healing
    99.99

    View Slide

  40. Key Skills of an SRE
    ● Withstand Boredom
    ● Attention towards detail
    ● *nix
    ● Programming
    ● Shell scripting
    ● On-call experience (real or assisted)
    ● Statistics

    View Slide

  41. These skills will rob my
    bank.
    Not an SRE

    View Slide

  42. How do you train SREs
    Based on organizational size and maturity
    • Train on Job
    • Self-Learn
    • Throw them in the Ocean
    SRE

    View Slide

  43. ● There is a cost to reliability
    ● 100% reliability is a myth
    ● Reliability can (and should) be achieved incrementally
    ● Reliability is improved in Dev sprints
    ● Reliability needs participation from every stakeholder
    ● Reliability is a function. Someone happened to call it SRE.
    Things to take away

    View Slide

  44. Thank you
    [email protected]
    @realmeson10
    last9.io

    View Slide