Site Reliability Engineering

Site Reliability Engineering (SRE)

What is SRE? - Keep the site up - whatever
it takes - Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity. - in a “software engineering” way - more scalable, more reliable, more efficient - - … almost same as DevOps

class SRE implements DevOps… says Google Five key areas of
DevOps - reduce organization silos - accept failure as normal - implement gradual change - leverage tooling & automation - measure everything

Developers vs (Sys)Operators

SRE team in Google - 50–60% are Google Software Engineers,
or more precisely, people who have been hired via the standard procedure for Google Software Engineers. - The others are candidates who were very close to the Google Software Engineering qualifications, and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. - Team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work - Google places a 50% cap on the aggregate "ops" work for all SREs - We want systems that are automatic, not just automated.

Main key principles of SRE

Error Budget - Stating from measuring the reliability of the
system - Naive approach: Availability = uptime / total time - Better approach: Availability = normal interactions / total interactions - In case of web service: successful requests / totla requests

Error Budget - How much error are we willing to
accept while releasing new software that could have bugs - Error budget = 100% - Target level of availability - Actually a “budget” we can spend every month - Benefit - Risk management by dev team. - Aligns incentives and emphasizes joint ownership between SRE and product development. - Make it easier to decide the rate of releases and to effectively discuss about it.

Service Level Terminology - SLI (service level indicator) - SLO
(service level objective): a target value or range of values for a service level that is measured by an SLI - SLA (service level agreements)

Indicators in Practice What Do You and Your Users Care
About? - User-facing serving systems: availability, latency, and throughput - Storage systems: latency, availability, and durability - Big data systems: throughput and end-to-end latency - also, correctness

Simplicity - System Stability Versus Agility - The Virtue of
Boring - “Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code.” - Remove unnecessary code - every new line of code written is a liability. - The "Negative Lines of Code" Metric - "software bloat" - A smaller project is easier to understand, easier to test, and frequently has fewer defects. - Minimal APIs - "perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away" by Antoine de Saint Exupery - Modularity (in terms of the design of distributed systems) - Release Simplicity

Practices of SRE

What SREs do - Monitoring - + alerts, tickets, logging
- Emergency Response - Change Management - progressive rollouts, detecting problems, rolling back - Demand Forecasting and Capacity Planning - Provisioning - combines both change management and capacity planning

Emergency Response - Things break; that’s life. - First of
all, Don’t Panic! - You aren’t alone, and the sky isn’t falling - Pull in more people

Test-Induced Emergency - Google has adopted a proactive approach to
disaster and emergency testing - SREs break our systems, watch how they fail, and make changes to improve reliability and prevent the failures from recurring - To identify some weaknesses or hidden dependencies and document follow-up actions to rectify the flaws we uncover

Postmortem: Learning from Failure Primary Goal - the incident is
documented - all contributing root cause(s) are well understood - effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence Should be a “blameless” postmortem

Postmortem: Learning from Failure Common postmortem triggers - User-visible downtime
or degradation beyond a certain threshold - Data loss of any kind - On-call engineer intervention (release rollback, rerouting of traffic, etc.) - A resolution time above some threshold - A monitoring failure (which usually implies manual incident discovery)

Postmortem Culture in Google - Postmortem of the month -
Google+ postmortem group - Postmortem reading clubs - Wheel of Misfortune - Disaster Role Playing Game - The formula is straightforward and bears some resemblance to a tabletop RPG (Role Playing Game): the "game master" (GM) picks two team members to be primary and secondary on-call; these two SREs join the GM at the front of the room. An incoming page is announced, and the on-call team responds with what they would do to mitigate and investigate the outage.

Example of Incident State / Postmortem Document https://landing.google.com/sre/sre-book/chapters/incident-document/ https://landing.google.com/sre/sre-book/chapters/postmortem/

Thank you

Site Reliability Engineering

Site Reliability Engineering

Buzzvil

More Decks by Buzzvil

Other Decks in Programming

Featured

Transcript