How to SRE: Getting Started with Site Reliability Engineering

Slide 1

Slide 1 text

Getting Started with Site Reliability Engineering Florian Rathgeber (@frathgeber) Site Reliability Engineer Google Cloud

Slide 2

Slide 2 text

Florian Site Reliability Engineer Google Cloud SRE for ~2 years ● On the Cloud Console SRE team ● Spend most of my time on SLOs Previous life ● Computational Scientist @ Imperial College ● Data Engineer @ ECMWF Co-founded PyData London

Slide 3

Slide 3 text

Software engineering as a discipline focuses on designing and building rather than operating and maintaining, despite estimates that 40%1 to 90%2 of the total costs are incurred after launch. 1 Glass, R. (2002). Facts and Fallacies of Software Engineering, Addison-Wesley Professional; p. 115. 2 Dehaghani, S. M. H., & Hajrahimi, N. (2013). Which Factors Affect Software Projects Maintenance Cost More? Acta Informatica Medica, 21(1), 63–66. http://doi.org/10.5455/AIM.2012.21.63-66 Software's long-term cost Image:Pixabay License. No attribution required.

Slide 4

Slide 4 text

Incentives aren't aligned. Developers Agility Operators Stability

Slide 5

Slide 5 text

DevOps is a set of practices, guidelines and culture designed to break down silos in IT development, operations, architecture, networking and security. class SRE implements DevOps Site Reliability Engineering is a set of practices we've found to work, some beliefs that animate those practices, and a job role.

Slide 6

Slide 6 text

Reducing product lifecycle friction Concept Business Development Operations Market Agile solves this DevOps solves this

Slide 7

Slide 7 text

● Originated at Google in 2003 ● Framework for operating large scale systems reliably ● "SRE is what happens when you ask a software engineer to design an operations function" ● Focuses on running systems in production What is Site Reliability Engineering?

Slide 8

Slide 8 text

Site Reliability Engineering Principles 1 SRE needs Service Level Objectives (SLOs), with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve.

Slide 9

Slide 9 text

Product lifecycle Concept Business Development Operations Market Site Reliability Engineering solves this problem Business Process

Slide 10

Slide 10 text

But getting started can feel daunting... Image: CC0 license: https://pxhere.com/en/photo/739800

Slide 11

Slide 11 text

Service Level Objectives

Slide 12

Slide 12 text

● Goal for how well the system should operate ● Tracks the customer experience ○ SLOs met = Customers ○ Customers = SLOs not met What is a Service Level Objective?

Slide 13

Slide 13 text

● 99.99% of HTTP requests per month succeed with 200 OK ● 90% of HTTP requests returned in under 300ms ● 99% of log entries processed in under 5 minutes Example SLOs

Slide 14

Slide 14 text

● Service Level Agreements = contractual guarantees ● SLAs met != Customers But What About SLAs?

Slide 15

Slide 15 text

● You could implement SLOs today for your application, but SLOs are only a foundation. ● You need consequences. What Next?

Slide 16

Slide 16 text

Error Budget Policy

Slide 17

Slide 17 text

How Reliable Do You Want To Be? The Bosses of the Senate (1889): Public Domain

Slide 18

Slide 18 text

How Reliable Do You Want To Be? More! The Bosses of the Senate (1889): Public Domain

Slide 19

Slide 19 text

“Anything that can go wrong will go wrong Murphy's Law Public Domain Image

Slide 20

Slide 20 text

“Anything that can go wrong, will… Finagle's Law of Dynamic Negatives Public Domain Image

Slide 21

Slide 21 text

Public Domain Image “Anything that can go wrong, will… ...at the worst possible moment. Finagle's Law of Dynamic Negatives

Slide 22

Slide 22 text

100% is the wrong reliability target for basically everything. Benjamin Treynor Sloss Vice President of 24x7 Engineering, Google “

Slide 23

Slide 23 text

Reliability Engineering Time Development Velocity Cost SRE is About Balance williamcho Pixabay License

Slide 24

Slide 24 text

So we introduce a budget Image Source: Florent Darrault CC BY-SA 2.0 Public Domain Image

Slide 25

Slide 25 text

● Gap between perfect reliability and our SLO. ● This is a budget to be spent. ● Given an uptime SLO of 99.9%, after a 20 minute outage you still have 23 minutes of budget remaining for the month! Error Budgets

Slide 26

Slide 26 text

● What you agree to do when the application exceeds its error budget. ● This is not "pay $$$". ● Must be something that will visibly improve reliability. Error Budget Policy

Slide 27

Slide 27 text

Until the application is again meeting its SLO and has some Error Budget: ● "No new feature launches allowed." ● "Sprint planning may only pull Postmortem Action Items from the backlog." ● "Software Development Team must meet with SRE Team daily to outline their improvements" Error Budget Policy Examples

Slide 28

Slide 28 text

SRE needs Service Level Objectives with Consequences. SRE Principle #1

Slide 29

Slide 29 text

● Even without hiring a single SRE, you can have an Error Budget Policy. ● An error budget is a lever you can use to keep your customers from experiencing pain and sadness. ● You can implement this today: measure, account and act. SRE Principle #1

Slide 30

Slide 30 text

Making Tomorrow Better Than Today

Slide 31

Slide 31 text

● SLOs and Error Budgets are the first step. ● The next step is staffing an SRE role... ● ...endowed with real responsibility. Making Tomorrow Better Than Today

Slide 32

Slide 32 text

● Defines and refines Service Level Objectives. ● Enacts the Error Budget Policy when necessary. ● Makes sure that the application meets the reliability expectations of its users. Your First SRE

Slide 33

Slide 33 text

● A bounded part of the role. ● Recommend that less than 50% of the workload be operations. Toil

Slide 34

Slide 34 text

● Consulting on System Architecture and Design ● Authoring and iterating on Monitoring ● Automating repetitive work ● Coordinating implementation of Postmortem Action Items Project Work

Slide 35

Slide 35 text

SREs have time to make tomorrow better than today. SRE Principle #2

Slide 36

Slide 36 text

SRE Principle #2 ● An SRE’s job is not to suffer under operational load, but to make each day brighter. ● "Brighter" might mean different things: it depends on what your SREs find most useful to do. ● Less toil, more meaningful system improvements.

Slide 37

Slide 37 text

Shared Responsibility Model

Slide 38

Slide 38 text

Dumping all production services on an SRE team cannot work. Photo By: Air Force Tech. Sgt. Jorge Intriago (Public Domain)

Slide 39

Slide 39 text

An overloaded team doesn’t have time to make tomorrow better than today. Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

Slide 40

Slide 40 text

Implementing a mechanism to give back pressure to dev partners provides balance. Used with permission of the image owner Jennifer Petoff, Sidewalk Safari Blog

Slide 41

Slide 41 text

● Give 5% of the operational work to the developers. ● Track SRE team project work. ○ Not completing projects? → Something’s wrong. ● Analyse and on-board new systems only if they can be operated safely. ● If every problem has to be escalated to its developer: why is SRE carrying the pager? Regulating Workload

Slide 42

Slide 42 text

Without leadership buy-in, SRE cannot work. Leadership Buy-in Image Credit: geralt Pixabay License

Slide 43

Slide 43 text

● When applications miss their SLOs and run out of Error Budget, it puts additional load on the SRE team. You need to either: ○ Devote more company resources to addressing reliability concerns ○ Loosen the SLO Leadership Buy-in

Slide 44

Slide 44 text

● Fixing a product after launch is always more expensive. ● SRE teams can and should consult up-front on designs: ○ Architecting resilient systems ○ Maintaining consistency means fewer SREs can support more products Reliability & Consistency Up Front

Slide 45

Slide 45 text

Three places SRE teams can benefit from Automation: 1. To eliminate their toil: Don't do things over and over! 2. To do capacity planning: Auto-scaling instead of manual forecasting! 3. To fix issues automatically: If you can write the fix in a playbook, you can make the computer do it! Automation

Slide 46

Slide 46 text

SRE teams have the ability to regulate their workload. SRE Principle #3

Slide 47

Slide 47 text

SRE Principle #3 ● Teams need to be able to prioritise and do the work. ● Each new system to maintain has a human cost. ● Must be able to push-back on unreliable practices and systems.

Slide 48

Slide 48 text

A Culture of Blamelessness

Slide 49

Slide 49 text

I'm extremely angry right now. People should lose their jobs if this was an error. --Hawaii State Representative Matt Lopresti (in reference to the 2018 Hawaii nuclear alert false alarm) Recognize the Antipattern Source: “How Hawaii Could Have Sent A False Nuclear Alarm”, Wired, Lapowski, January 13, 2018 https://www.wired.com/story/hawaii-nuclear-missile-alert-false-explanation/ “

Slide 50

Slide 50 text

● by setting SLOs less than 100% ● by modeling blamelessness at all levels ● by stamping out blame wherever it is found ● by celebrating cases of “I made a mistake” that lead to outages being resolved faster Embrace Failure

Slide 51

Slide 51 text

● You’ve already paid the price in an outage. ● Write a blameless postmortem. ● Make postmortems widely available so others can learn, too. Learn from Failure

Slide 52

Slide 52 text

“Human” errors are really systems problems.

Slide 53

Slide 53 text

● The root cause of an outage is never a person. ● Ask “why” for as many iterations as it takes to identify system-related causes. ● Prioritize system fixes that support people to make the right choices. Keep Asking Why

Slide 54

Slide 54 text

Failure is an opportunity to improve. SRE Principle #4

Slide 55

Slide 55 text

Failure is an opportunity to improve. Not an excuse to brandish pitchforks SRE Principle #4

Slide 56

Slide 56 text

SRE Principle #4 ● Failure happens. There is no way around it. ● Stop pointing fingers. ● Embrace failure to improve MTTD and MTTR. ● Proactively addressing failure → more robust systems.

Slide 57

Slide 57 text

Site Reliability Engineering Principles 1 SRE needs Service Level Objectives, with consequences. 2 SREs must have time to make tomorrow better than today. 3 SRE teams have the ability to regulate their workload. 4 Failure is an opportunity to improve.

Slide 58

Slide 58 text

Cover images used with permission. These books can be found on shop.oreilly.com The full text of the Google SRE Books are available at www.google.com/sre