Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to SRE: Getting Started with Site Reliability Engineering

How to SRE: Getting Started with Site Reliability Engineering

This talk is a practical introduction for getting started with SRE in
your organisation. From the origins of SRE at Google in 2003, this
talk covers the key principles: Service Level Objectives, error
budgets, shared responsibility and blamelessness.

Florian Rathgeber

January 28, 2020

More Decks by Florian Rathgeber

Other Decks in Technology


  1. Florian Site Reliability Engineer Google Cloud SRE for ~2 years

    • On the Cloud Console SRE team • Spend most of my time on SLOs Previous life • Computational Scientist @ Imperial College • Data Engineer @ ECMWF Co-founded PyData London
  2. Software engineering as a discipline focuses on designing and building

    rather than operating and maintaining, despite estimates that 40%1 to 90%2 of the total costs are incurred after launch. 1 Glass, R. (2002). Facts and Fallacies of Software Engineering, Addison-Wesley Professional; p. 115. 2 Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors Affect Software Projects Maintenance Cost More? Acta Informatica Medica, 21(1), 63–66. http://doi.org/10.5455/AIM.2012.21.63-66 Software's long-term cost Image:Pixabay License. No attribution required.
  3. DevOps is a set of practices, guidelines and culture designed

    to break down silos in IT development, operations, architecture, networking and security. class SRE implements DevOps Site Reliability Engineering is a set of practices we've found to work, some beliefs that animate those practices, and a job role.
  4. • Originated at Google in 2003 • Framework for operating

    large scale systems reliably • "SRE is what happens when you ask a software engineer to design an operations function" • Focuses on running systems in production What is Site Reliability Engineering?
  5. Site Reliability Engineering Principles 1 SRE needs Service Level Objectives

    (SLOs), with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve.
  6. • Goal for how well the system should operate •

    Tracks the customer experience ◦ SLOs met = Customers ◦ Customers = SLOs not met What is a Service Level Objective?
  7. • 99.99% of HTTP requests per month succeed with 200

    OK • 90% of HTTP requests returned in under 300ms • 99% of log entries processed in under 5 minutes Example SLOs
  8. • You could implement SLOs today for your application, but

    SLOs are only a foundation. • You need consequences. What Next?
  9. How Reliable Do You Want To Be? The Bosses of

    the Senate (1889): Public Domain
  10. How Reliable Do You Want To Be? More! The Bosses

    of the Senate (1889): Public Domain
  11. Public Domain Image “Anything that can go wrong, will… ...at

    the worst possible moment. Finagle's Law of Dynamic Negatives
  12. 100% is the wrong reliability target for basically everything. Benjamin

    Treynor Sloss Vice President of 24x7 Engineering, Google “
  13. • Gap between perfect reliability and our SLO. • This

    is a budget to be spent. • Given an uptime SLO of 99.9%, after a 20 minute outage you still have 23 minutes of budget remaining for the month! Error Budgets
  14. • What you agree to do when the application exceeds

    its error budget. • This is not "pay $$$". • Must be something that will visibly improve reliability. Error Budget Policy
  15. Until the application is again meeting its SLO and has

    some Error Budget: • "No new feature launches allowed." • "Sprint planning may only pull Postmortem Action Items from the backlog." • "Software Development Team must meet with SRE Team daily to outline their improvements" Error Budget Policy Examples
  16. • Even without hiring a single SRE, you can have

    an Error Budget Policy. • An error budget is a lever you can use to keep your customers from experiencing pain and sadness. • You can implement this today: measure, account and act. SRE Principle #1
  17. • SLOs and Error Budgets are the first step. •

    The next step is staffing an SRE role... • ...endowed with real responsibility. Making Tomorrow Better Than Today
  18. • Defines and refines Service Level Objectives. • Enacts the

    Error Budget Policy when necessary. • Makes sure that the application meets the reliability expectations of its users. Your First SRE
  19. • A bounded part of the role. • Recommend that

    less than 50% of the workload be operations. Toil
  20. • Consulting on System Architecture and Design • Authoring and

    iterating on Monitoring • Automating repetitive work • Coordinating implementation of Postmortem Action Items Project Work
  21. SRE Principle #2 • An SRE’s job is not to

    suffer under operational load, but to make each day brighter. • "Brighter" might mean different things: it depends on what your SREs find most useful to do. • Less toil, more meaningful system improvements.
  22. Dumping all production services on an SRE team cannot work.

    Photo By: Air Force Tech. Sgt. Jorge Intriago (Public Domain)
  23. An overloaded team doesn’t have time to make tomorrow better

    than today. Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog
  24. Implementing a mechanism to give back pressure to dev partners

    provides balance. Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog
  25. • Give 5% of the operational work to the developers.

    • Track SRE team project work. ◦ Not completing projects? → Something’s wrong. • Analyse and on-board new systems only if they can be operated safely. • If every problem has to be escalated to its developer: why is SRE carrying the pager? Regulating Workload
  26. • When applications miss their SLOs and run out of

    Error Budget, it puts additional load on the SRE team. You need to either: ◦ Devote more company resources to addressing reliability concerns ◦ Loosen the SLO Leadership Buy-in
  27. • Fixing a product after launch is always more expensive.

    • SRE teams can and should consult up-front on designs: ◦ Architecting resilient systems ◦ Maintaining consistency means fewer SREs can support more products Reliability & Consistency Up Front
  28. Three places SRE teams can benefit from Automation: 1. To

    eliminate their toil: Don't do things over and over! 2. To do capacity planning: Auto-scaling instead of manual forecasting! 3. To fix issues automatically: If you can write the fix in a playbook, you can make the computer do it! Automation
  29. SRE Principle #3 • Teams need to be able to

    prioritise and do the work. • Each new system to maintain has a human cost. • Must be able to push-back on unreliable practices and systems.
  30. I'm extremely angry right now. People should lose their jobs

    if this was an error. --Hawaii State Representative Matt Lopresti (in reference to the 2018 Hawaii nuclear alert false alarm) Recognize the Antipattern Source: “How Hawaii Could Have Sent A False Nuclear Alarm”, Wired, Lapowski, January 13, 2018 https://www.wired.com/story/hawaii-nuclear-missile-alert-false-explanation/ “
  31. • by setting SLOs less than 100% • by modeling

    blamelessness at all levels • by stamping out blame wherever it is found • by celebrating cases of “I made a mistake” that lead to outages being resolved faster Embrace Failure
  32. • You’ve already paid the price in an outage. •

    Write a blameless postmortem. • Make postmortems widely available so others can learn, too. Learn from Failure
  33. • The root cause of an outage is never a

    person. • Ask “why” for as many iterations as it takes to identify system-related causes. • Prioritize system fixes that support people to make the right choices. Keep Asking Why
  34. Failure is an opportunity to improve. Not an excuse to

    brandish pitchforks SRE Principle #4
  35. SRE Principle #4 • Failure happens. There is no way

    around it. • Stop pointing fingers. • Embrace failure to improve MTTD and MTTR. • Proactively addressing failure → more robust systems.
  36. Site Reliability Engineering Principles 1 SRE needs Service Level Objectives,

    with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve.
  37. Cover images used with permission. These books can be found

    on shop.oreilly.com The full text of the Google SRE Books are available at www.google.com/sre