$30 off During Our Annual Pro Sale. View Details »

How to SRE: Getting Started with Site Reliability Engineering

How to SRE: Getting Started with Site Reliability Engineering

This talk is a practical introduction for getting started with SRE in
your organisation. From the origins of SRE at Google in 2003, this
talk covers the key principles: Service Level Objectives, error
budgets, shared responsibility and blamelessness.

Florian Rathgeber

January 28, 2020
Tweet

More Decks by Florian Rathgeber

Other Decks in Technology

Transcript

  1. Getting Started with Site Reliability Engineering
    Florian Rathgeber (@frathgeber)
    Site Reliability Engineer
    Google Cloud

    View Slide

  2. Florian
    Site Reliability Engineer
    Google Cloud
    SRE for ~2 years
    ● On the Cloud Console SRE team
    ● Spend most of my time on
    SLOs
    Previous life
    ● Computational Scientist @
    Imperial College
    ● Data Engineer @ ECMWF
    Co-founded PyData London

    View Slide

  3. Software engineering as a
    discipline focuses on designing
    and building rather than
    operating and maintaining,
    despite estimates that 40%1 to
    90%2 of the total costs are
    incurred after launch.
    1 Glass, R. (2002). Facts and Fallacies of Software
    Engineering, Addison-Wesley Professional; p. 115.
    2 Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors
    Affect Software Projects Maintenance Cost More? Acta
    Informatica Medica, 21(1), 63–66.
    http://doi.org/10.5455/AIM.2012.21.63-66
    Software's
    long-term
    cost
    Image:Pixabay License. No attribution required.

    View Slide

  4. Incentives aren't aligned.
    Developers
    Agility
    Operators
    Stability

    View Slide

  5. DevOps
    is a set of practices,
    guidelines and culture
    designed to break down
    silos in IT development,
    operations, architecture,
    networking and security.
    class SRE implements DevOps
    Site Reliability
    Engineering
    is a set of practices we've
    found to work, some
    beliefs that animate those
    practices, and a job role.

    View Slide

  6. Reducing product lifecycle friction
    Concept Business Development Operations Market
    Agile
    solves this
    DevOps
    solves this

    View Slide

  7. ● Originated at Google in 2003
    ● Framework for operating large scale systems reliably
    ● "SRE is what happens when you ask a software
    engineer to design an operations function"
    ● Focuses on running systems in production
    What is Site Reliability Engineering?

    View Slide

  8. Site Reliability Engineering Principles
    1 SRE needs Service Level Objectives (SLOs), with
    consequences.
    2 SREs must have time to make tomorrow better
    than today.
    3 SRE teams have the ability to regulate their
    workload.
    4 Failure is an opportunity to improve.

    View Slide

  9. Product lifecycle
    Concept Business Development Operations Market
    Site Reliability
    Engineering
    solves this problem
    Business Process

    View Slide

  10. But getting started
    can feel daunting...
    Image: CC0 license: https://pxhere.com/en/photo/739800

    View Slide

  11. Service Level Objectives

    View Slide

  12. ● Goal for how well the system should operate
    ● Tracks the customer experience
    ○ SLOs met = Customers
    ○ Customers = SLOs not met
    What is a Service Level Objective?

    View Slide

  13. ● 99.99% of HTTP requests per month succeed with 200 OK
    ● 90% of HTTP requests returned in under 300ms
    ● 99% of log entries processed in under 5 minutes
    Example SLOs

    View Slide

  14. ● Service Level Agreements = contractual guarantees
    ● SLAs met != Customers
    But What About SLAs?

    View Slide

  15. ● You could implement SLOs today for your
    application, but SLOs are only a foundation.
    ● You need consequences.
    What Next?

    View Slide

  16. Error Budget Policy

    View Slide

  17. How Reliable Do You Want To Be?
    The Bosses of the Senate (1889): Public Domain

    View Slide

  18. How Reliable Do You Want To Be?
    More!
    The Bosses of the Senate (1889): Public Domain

    View Slide

  19. “Anything that
    can go wrong
    will go wrong
    Murphy's Law
    Public Domain Image

    View Slide

  20. “Anything that can go
    wrong, will…
    Finagle's Law of
    Dynamic Negatives
    Public Domain Image

    View Slide

  21. Public Domain Image
    “Anything that can go
    wrong, will…
    ...at the worst possible
    moment.
    Finagle's Law of
    Dynamic Negatives

    View Slide

  22. 100% is the wrong reliability
    target for basically everything.
    Benjamin Treynor Sloss
    Vice President of 24x7 Engineering, Google

    View Slide

  23. Reliability
    Engineering Time
    Development Velocity
    Cost
    SRE is About Balance
    williamcho Pixabay License

    View Slide

  24. So we introduce a budget
    Image Source: Florent Darrault CC BY-SA 2.0
    Public Domain Image

    View Slide

  25. ● Gap between perfect reliability and our SLO.
    ● This is a budget to be spent.
    ● Given an uptime SLO of 99.9%, after a 20 minute
    outage you still have 23 minutes of budget
    remaining for the month!
    Error Budgets

    View Slide

  26. ● What you agree to do when the application exceeds
    its error budget.
    ● This is not "pay $$$".
    ● Must be something that will visibly improve
    reliability.
    Error Budget Policy

    View Slide

  27. Until the application is again meeting its SLO and has
    some Error Budget:
    ● "No new feature launches allowed."
    ● "Sprint planning may only pull Postmortem Action
    Items from the backlog."
    ● "Software Development Team must meet with SRE
    Team daily to outline their improvements"
    Error Budget Policy Examples

    View Slide

  28. SRE needs Service
    Level Objectives with
    Consequences.
    SRE Principle #1

    View Slide

  29. ● Even without hiring a single SRE, you can have an
    Error Budget Policy.
    ● An error budget is a lever you can use to keep your
    customers from experiencing pain and sadness.
    ● You can implement this today: measure, account
    and act.
    SRE Principle #1

    View Slide

  30. Making Tomorrow Better Than Today

    View Slide

  31. ● SLOs and Error Budgets are the first step.
    ● The next step is staffing an SRE role...
    ● ...endowed with real responsibility.
    Making Tomorrow Better Than
    Today

    View Slide

  32. ● Defines and refines Service Level Objectives.
    ● Enacts the Error Budget Policy when necessary.
    ● Makes sure that the application meets the
    reliability expectations of its users.
    Your First SRE

    View Slide

  33. ● A bounded part of the role.
    ● Recommend that less than 50% of the
    workload be operations.
    Toil

    View Slide

  34. ● Consulting on System Architecture and Design
    ● Authoring and iterating on Monitoring
    ● Automating repetitive work
    ● Coordinating implementation of Postmortem
    Action Items
    Project Work

    View Slide

  35. SREs have time to
    make tomorrow better
    than today.
    SRE Principle #2

    View Slide

  36. SRE Principle #2
    ● An SRE’s job is not to suffer under operational load,
    but to make each day brighter.
    ● "Brighter" might mean different things: it depends on
    what your SREs find most useful to do.
    ● Less toil, more meaningful system improvements.

    View Slide

  37. Shared Responsibility Model

    View Slide

  38. Dumping all
    production
    services on
    an SRE team
    cannot work.
    Photo By: Air Force Tech. Sgt. Jorge Intriago (Public Domain)

    View Slide

  39. An overloaded
    team doesn’t
    have time to
    make tomorrow
    better than
    today.
    Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

    View Slide

  40. Implementing a
    mechanism to give
    back pressure to
    dev partners
    provides balance.
    Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

    View Slide

  41. ● Give 5% of the operational work to the developers.
    ● Track SRE team project work.
    ○ Not completing projects? → Something’s wrong.
    ● Analyse and on-board new systems only if they can be
    operated safely.
    ● If every problem has to be escalated to its developer:
    why is SRE carrying the pager?
    Regulating Workload

    View Slide

  42. Without
    leadership
    buy-in, SRE
    cannot work.
    Leadership Buy-in
    Image Credit: geralt Pixabay License

    View Slide

  43. ● When applications miss their SLOs and run out of
    Error Budget, it puts additional load on the SRE team.
    You need to either:
    ○ Devote more company resources to addressing
    reliability concerns
    ○ Loosen the SLO
    Leadership Buy-in

    View Slide

  44. ● Fixing a product after launch is always more
    expensive.
    ● SRE teams can and should consult up-front on
    designs:
    ○ Architecting resilient systems
    ○ Maintaining consistency means fewer SREs can
    support more products
    Reliability & Consistency Up Front

    View Slide

  45. Three places SRE teams can benefit from Automation:
    1. To eliminate their toil: Don't do things over and over!
    2. To do capacity planning: Auto-scaling instead of manual
    forecasting!
    3. To fix issues automatically: If you can write the fix in a
    playbook, you can make the computer do it!
    Automation

    View Slide

  46. SRE teams have the
    ability to regulate
    their workload.
    SRE Principle #3

    View Slide

  47. SRE Principle #3
    ● Teams need to be able to prioritise and do the work.
    ● Each new system to maintain has a human cost.
    ● Must be able to push-back on unreliable practices
    and systems.

    View Slide

  48. A Culture of Blamelessness

    View Slide

  49. I'm extremely angry right now. People
    should lose their jobs if this was an error.
    --Hawaii State Representative Matt Lopresti
    (in reference to the 2018 Hawaii nuclear alert false alarm)
    Recognize the Antipattern
    Source: “How Hawaii Could Have Sent A False Nuclear Alarm”, Wired, Lapowski,
    January 13, 2018
    https://www.wired.com/story/hawaii-nuclear-missile-alert-false-explanation/

    View Slide

  50. ● by setting SLOs less than 100%
    ● by modeling blamelessness at all levels
    ● by stamping out blame wherever it is found
    ● by celebrating cases of “I made a mistake” that
    lead to outages being resolved faster
    Embrace Failure

    View Slide

  51. ● You’ve already paid the price in an outage.
    ● Write a blameless postmortem.
    ● Make postmortems widely available so others can
    learn, too.
    Learn from Failure

    View Slide

  52. “Human”
    errors are
    really systems
    problems.

    View Slide

  53. ● The root cause of an outage is never a person.
    ● Ask “why” for as many iterations as it takes to
    identify system-related causes.
    ● Prioritize system fixes that support people to make
    the right choices.
    Keep Asking Why

    View Slide

  54. Failure is an
    opportunity to
    improve.
    SRE Principle #4

    View Slide

  55. Failure is an
    opportunity to
    improve.
    Not an excuse to brandish pitchforks
    SRE Principle #4

    View Slide

  56. SRE Principle #4
    ● Failure happens. There is no way around it.
    ● Stop pointing fingers.
    ● Embrace failure to improve MTTD and MTTR.
    ● Proactively addressing failure → more robust systems.

    View Slide

  57. Site Reliability Engineering Principles
    1 SRE needs Service Level Objectives, with
    consequences.
    2 SREs must have time to make tomorrow better
    than today.
    3 SRE teams have the ability to regulate their
    workload.
    4 Failure is an opportunity to improve.

    View Slide

  58. Cover images used with permission. These books can be found on shop.oreilly.com
    The full text of the Google SRE Books are available at www.google.com/sre

    View Slide