Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering

Buzzvil
November 21, 2018

Site Reliability Engineering

By Dio

Buzzvil

November 21, 2018
Tweet

More Decks by Buzzvil

Other Decks in Programming

Transcript

  1. What is SRE? - Keep the site up - whatever

    it takes - Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity. - in a “software engineering” way - more scalable, more reliable, more efficient - - … almost same as DevOps
  2. class SRE implements DevOps… says Google Five key areas of

    DevOps - reduce organization silos - accept failure as normal - implement gradual change - leverage tooling & automation - measure everything
  3. SRE team in Google - 50–60% are Google Software Engineers,

    or more precisely, people who have been hired via the standard procedure for Google Software Engineers. - The others are candidates who were very close to the Google Software Engineering qualifications, and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. - Team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work - Google places a 50% cap on the aggregate "ops" work for all SREs - We want systems that are automatic, not just automated.
  4. Error Budget - Stating from measuring the reliability of the

    system - Naive approach: Availability = uptime / total time - Better approach: Availability = normal interactions / total interactions - In case of web service: successful requests / totla requests
  5. Error Budget - How much error are we willing to

    accept while releasing new software that could have bugs - Error budget = 100% - Target level of availability - Actually a “budget” we can spend every month - Benefit - Risk management by dev team. - Aligns incentives and emphasizes joint ownership between SRE and product development. - Make it easier to decide the rate of releases and to effectively discuss about it.
  6. Service Level Terminology - SLI (service level indicator) - SLO

    (service level objective): a target value or range of values for a service level that is measured by an SLI - SLA (service level agreements)
  7. Indicators in Practice What Do You and Your Users Care

    About? - User-facing serving systems: availability, latency, and throughput - Storage systems: latency, availability, and durability - Big data systems: throughput and end-to-end latency - also, correctness
  8. Simplicity - System Stability Versus Agility - The Virtue of

    Boring - “Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code.” - Remove unnecessary code - every new line of code written is a liability. - The "Negative Lines of Code" Metric - "software bloat" - A smaller project is easier to understand, easier to test, and frequently has fewer defects. - Minimal APIs - "perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away" by Antoine de Saint Exupery - Modularity (in terms of the design of distributed systems) - Release Simplicity
  9. What SREs do - Monitoring - + alerts, tickets, logging

    - Emergency Response - Change Management - progressive rollouts, detecting problems, rolling back - Demand Forecasting and Capacity Planning - Provisioning - combines both change management and capacity planning
  10. Emergency Response - Things break; that’s life. - First of

    all, Don’t Panic! - You aren’t alone, and the sky isn’t falling - Pull in more people
  11. Test-Induced Emergency - Google has adopted a proactive approach to

    disaster and emergency testing - SREs break our systems, watch how they fail, and make changes to improve reliability and prevent the failures from recurring - To identify some weaknesses or hidden dependencies and document follow-up actions to rectify the flaws we uncover
  12. Postmortem: Learning from Failure Primary Goal - the incident is

    documented - all contributing root cause(s) are well understood - effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence Should be a “blameless” postmortem
  13. Postmortem: Learning from Failure Common postmortem triggers - User-visible downtime

    or degradation beyond a certain threshold - Data loss of any kind - On-call engineer intervention (release rollback, rerouting of traffic, etc.) - A resolution time above some threshold - A monitoring failure (which usually implies manual incident discovery)
  14. Postmortem Culture in Google - Postmortem of the month -

    Google+ postmortem group - Postmortem reading clubs - Wheel of Misfortune - Disaster Role Playing Game - The formula is straightforward and bears some resemblance to a tabletop RPG (Role Playing Game): the "game master" (GM) picks two team members to be primary and secondary on-call; these two SREs join the GM at the front of the room. An incoming page is announced, and the on-call team responds with what they would do to mitigate and investigate the outage.