Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Doing SRE the right way

Doing SRE the right way

Site Reliability Engineering (SRE) is an expensive proposition for small and large organizations alike. You can mitigate the costs by:

Setting up the right organizational structure for SRE
Hiring for the right team composition
Building the SRE culture

Piyush Verma, co-founder of Last9 Inc shares practical insights and wisdom, from accumulated successful and not-so-successful experiences.

Piyush Verma

June 16, 2020
Tweet

More Decks by Piyush Verma

Other Decks in Programming

Transcript

  1. Grooming Organizational structures, teams,
    capacity and culture
    Piyush Verma
    @realmeson10
    Doing SRE the
    Right Way ⅓

    View full-size slide

  2. 1. What is SRE?
    2. Vs DevOps?
    3. How much does reliability cost?
    4. How long before I am reliable?
    5. Isn’t it too early to automate?
    6. What does an SRE code?
    7. Is Reliability achieved before/after release?
    8. What is the right SRE organization structure?
    .. .. .. .. ask in comments
    FAQs

    View full-size slide

  3. Failure Cycle of Physical Products

    View full-size slide

  4. Failure Cycle of Software Products

    View full-size slide

  5. Feature Upgrade vs Stability

    View full-size slide

  6. Some approaches to
    build reliability
    ● Ignorance
    ● Throw more engineers at the problem
    ● Recovery Blocks
    ● NVersion Programming

    View full-size slide

  7. Instead, what we
    need is...
    Self-checking software
    [Michael Lyu]
    Observability
    Comprehensive metrics
    And, Control
    Policies that adhere to Production

    View full-size slide

  8. Software Operations
    should be treated
    … Like a Product

    View full-size slide

  9. Software Products vs Software Operations?
    Product
    • Product Manager
    • Customer feedback
    • Product metrics
    • Google analytics
    • Conversion funnels
    Operations
    • One failure at a time
    • Ad-hoc increments
    • On-call softwares
    • Re-write
    • Kubernetes
    • Data driven decisions
    • Prioritize your failures
    • Put a value to the loss

    View full-size slide

  10. What do you see?

    View full-size slide

  11. Viewing from a SRE lens:
    ● 35% have not scaled *enough* yet.
    ● 50% have to wait before they scale
    ● 25% are paying a person to do this
    ● 35% are about to hit this problem at
    some point

    View full-size slide

  12. Data We Gather (1

    View full-size slide

  13. Data We Produce (2

    View full-size slide

  14. Site Reliability
    Engineering
    Where do we start?

    View full-size slide

  15. Reliability (Quality → Metric)

    View full-size slide

  16. Robust vs Reliable
    - Should be up?
    - Should be up AND not serve errors?
    - Should be up AND not serve errors AND serve correct data?
    - Should be up AND not serve errors AND serve correct data AND
    within no time?
    - Should be up AND not serve errors AND serve correct data AND
    within no time AND serve unlimited TPS?
    - Should be up AND not serve errors AND serve correct data AND
    within no time AND serve unlimited TPS AND concurrently?

    View full-size slide

  17. Reliability is not a
    buffet lunch.

    View full-size slide

  18. There is no bug-free
    software
    There is no 100%
    reliable software

    View full-size slide

  19. Step 1 Service Level Indicators

    View full-size slide

  20. Step2
    Service Level Objectives

    View full-size slide

  21. Service Level Objectives
    Reliability
    Release
    Velocity

    View full-size slide

  22. How to set SLOs
    • Measurable
    • Customer-oriented
    • Challenging
    • Unambiguous
    • All Stakeholders participate

    View full-size slide

  23. How to set SLOs
    - Should be up through the day. Downtime allowed 3  315 AM
    - Only 1% requests should have status code  4xx
    - Only 0.05% of /deep-health POST  GET can mismatch
    - Only 1% requests should be slower than 100ms
    - Numbers should hold upto 100 TPs
    - Numbers should hold upto 10 concurrent requests

    View full-size slide

  24. SLOs detailed
    A SLO of 99% of a service with 1 million requests in a month; allows for
    10000 failures.
    If current failure counter == 9000 and there is still a fortnight left; All
    hands, towards stability.
    If current failure counter == 500, release as often as you want.

    View full-size slide

  25. SLOs detailed
    • Choosing a time window
    • Choosing aggregations
    • Window lengths

    View full-size slide

  26. You can only do so
    much, manually.
    Enter Site Reliability Engineering.

    View full-size slide

  27. So what does a SRE do?
    • When an outage happens, minimize the downtime
    • Where else is this happening?
    • Prevent repetition

    View full-size slide

  28. The SRE Recipe
    • Observability - 2 tbsp
    • Control - 2 tbsp
    • Automation - 1 tbsp
    • Root Cause Analysis - 3 tbsp
    • Cross-org collaboration - 3 tbsp
    • Guard Rails/Frameworks - 4 tbsp

    View full-size slide

  29. Isn’t K8s and cloud already automated?
    • If a human does a thing 3x over
    • 3 different results
    • 3 different bugs
    • 1 highly demotivated employee

    View full-size slide

  30. When/Do I need that
    much automation?
    Servers can scale, people
    cannot.

    View full-size slide

  31. Right time to focus on SRE?
    when($ spend < $ lost)

    View full-size slide

  32. December 1, 1913
    12 hours → 150 minutes

    View full-size slide

  33. But that’s exactly how they sold DevOps!
    DevOps is the goal.
    SRE is a way to get there.
    Functional Programming: Software Development :: SRE : DevOps

    View full-size slide

  34. So, what did SRE change?
    • Feature development is owned by Product Developers
    • Quality is owned by QA / Release Managers
    • Who owned the Uptime?
    • Especially when choosing between Stability vs Feature Release
    • Uptime didn’t have owners
    • Who owns Capacity Plan and spend?
    • More felt over the last 2 decades
    • Shops remain open 247
    • Consumption is 247

    View full-size slide

  35. SRE brings to software development
    what assembly line brought to
    manufacturing.
    Over 60% failures were bad deployments and 38% issues were
    repeated.
    SRE practice helped cut down operations cost by 60% within a year.

    View full-size slide

  36. Reliability is not a
    buffet lunch.
    Go A-la-carte

    View full-size slide

  37. Tiered SRE approach
    Sporadic work, no dedicated SRE staffing (Tier 0
    Put observability in place
    Projects, some dedicated SRE time (Tier 1
    Aim to unify dataset across services to build common improvements.
    Onboard one service and on-call time (Tier 2
    Setup on-call and escalation matrix. See how teams and services adopt
    Onboard other services. Reuse SRE tooling and practices (Tier 3
    Go org-wide

    View full-size slide

  38. SRE team structures
    1SRE Common
    Tooling
    Embedded
    Outsourced

    View full-size slide

  39. SRE Maturity Model
    9x
    Beginner
    ● Canary/Rolling
    ● Coverage
    Validations
    ● SLI Collection
    ● On-Call
    ● Runbooks
    Fault Tolerance
    ● 9x +
    ● Tolerance Proof
    ● SLO
    ● Disaster
    Recovery
    ● Rollback
    99
    Automated
    ● 99 +
    ● Chaos Engg
    ● Capacity Plans
    ● Pattern based
    re-architect.
    99.9
    Automatic
    ● 99.9 +
    ● Failure Mode
    effect analysis
    ● Cost Analysis
    ● If this then That
    ● Auto Healing
    99.99

    View full-size slide

  40. Key Skills of an SRE
    ● Withstand Boredom
    ● Attention towards detail
    ● *nix
    ● Programming
    ● Shell scripting
    ● On-call experience (real or assisted)
    ● Statistics

    View full-size slide

  41. These skills will rob my
    bank.
    Not an SRE

    View full-size slide

  42. How do you train SREs
    Based on organizational size and maturity
    • Train on Job
    • Self-Learn
    • Throw them in the Ocean
    SRE

    View full-size slide

  43. ● There is a cost to reliability
    ● 100% reliability is a myth
    ● Reliability can (and should) be achieved incrementally
    ● Reliability is improved in Dev sprints
    ● Reliability needs participation from every stakeholder
    ● Reliability is a function. Someone happened to call it SRE.
    Things to take away

    View full-size slide

  44. Thank you
    [email protected]
    @realmeson10
    last9.io

    View full-size slide