Site Reliability Engineering

by Buzzvil

Embed

Start on current slide

Slide 1

Slide 1 text

Site Reliability Engineering (SRE)

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

What is SRE? - Keep the site up - whatever it takes - Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity. - in a “software engineering” way - more scalable, more reliable, more efficient - - … almost same as DevOps

Slide 4

Slide 4 text

class SRE implements DevOps… says Google Five key areas of DevOps - reduce organization silos - accept failure as normal - implement gradual change - leverage tooling & automation - measure everything

Slide 5

Slide 5 text

Developers vs (Sys)Operators

Slide 6

Slide 6 text

SRE team in Google - 50–60% are Google Software Engineers, or more precisely, people who have been hired via the standard procedure for Google Software Engineers. - The others are candidates who were very close to the Google Software Engineering qualifications, and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. - Team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work - Google places a 50% cap on the aggregate "ops" work for all SREs - We want systems that are automatic, not just automated.

Slide 7

Slide 7 text

Main key principles of SRE

Slide 8

Slide 8 text

Error Budget - Stating from measuring the reliability of the system - Naive approach: Availability = uptime / total time - Better approach: Availability = normal interactions / total interactions - In case of web service: successful requests / totla requests

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Error Budget - How much error are we willing to accept while releasing new software that could have bugs - Error budget = 100% - Target level of availability - Actually a “budget” we can spend every month - Benefit - Risk management by dev team. - Aligns incentives and emphasizes joint ownership between SRE and product development. - Make it easier to decide the rate of releases and to effectively discuss about it.

Slide 11

Slide 11 text

Service Level Terminology - SLI (service level indicator) - SLO (service level objective): a target value or range of values for a service level that is measured by an SLI - SLA (service level agreements)

Slide 12

Slide 12 text

Indicators in Practice What Do You and Your Users Care About? - User-facing serving systems: availability, latency, and throughput - Storage systems: latency, availability, and durability - Big data systems: throughput and end-to-end latency - also, correctness

Slide 13

Slide 13 text

Simplicity - System Stability Versus Agility - The Virtue of Boring - “Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code.” - Remove unnecessary code - every new line of code written is a liability. - The "Negative Lines of Code" Metric - "software bloat" - A smaller project is easier to understand, easier to test, and frequently has fewer defects. - Minimal APIs - "perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away" by Antoine de Saint Exupery - Modularity (in terms of the design of distributed systems) - Release Simplicity

Slide 14

Slide 14 text

Practices of SRE

Slide 15

Slide 15 text

What SREs do - Monitoring - + alerts, tickets, logging - Emergency Response - Change Management - progressive rollouts, detecting problems, rolling back - Demand Forecasting and Capacity Planning - Provisioning - combines both change management and capacity planning

Slide 16

Slide 16 text

Emergency Response - Things break; that’s life. - First of all, Don’t Panic! - You aren’t alone, and the sky isn’t falling - Pull in more people

Slide 17

Slide 17 text

Test-Induced Emergency - Google has adopted a proactive approach to disaster and emergency testing - SREs break our systems, watch how they fail, and make changes to improve reliability and prevent the failures from recurring - To identify some weaknesses or hidden dependencies and document follow-up actions to rectify the flaws we uncover

Slide 18

Slide 18 text

Postmortem: Learning from Failure Primary Goal - the incident is documented - all contributing root cause(s) are well understood - effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence Should be a “blameless” postmortem

Slide 19

Slide 19 text

Postmortem: Learning from Failure Common postmortem triggers - User-visible downtime or degradation beyond a certain threshold - Data loss of any kind - On-call engineer intervention (release rollback, rerouting of traffic, etc.) - A resolution time above some threshold - A monitoring failure (which usually implies manual incident discovery)

Slide 20

Slide 20 text

Postmortem Culture in Google - Postmortem of the month - Google+ postmortem group - Postmortem reading clubs - Wheel of Misfortune - Disaster Role Playing Game - The formula is straightforward and bears some resemblance to a tabletop RPG (Role Playing Game): the "game master" (GM) picks two team members to be primary and secondary on-call; these two SREs join the GM at the front of the room. An incoming page is announced, and the on-call team responds with what they would do to mitigate and investigate the outage.

Slide 21

Slide 21 text

Example of Incident State / Postmortem Document https://landing.google.com/sre/sre-book/chapters/incident-document/ https://landing.google.com/sre/sre-book/chapters/postmortem/

Slide 22

Slide 22 text

Thank you