Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Site Reliability Engineering

Avatar for Buzzvil Buzzvil
November 21, 2018

Site Reliability Engineering

By Dio

Avatar for Buzzvil

Buzzvil

November 21, 2018
Tweet

More Decks by Buzzvil

Other Decks in Programming

Transcript

  1. What is SRE? - Keep the site up - whatever

    it takes - Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity. - in a “software engineering” way - more scalable, more reliable, more efficient - - … almost same as DevOps
  2. class SRE implements DevOps… says Google Five key areas of

    DevOps - reduce organization silos - accept failure as normal - implement gradual change - leverage tooling & automation - measure everything
  3. SRE team in Google - 50–60% are Google Software Engineers,

    or more precisely, people who have been hired via the standard procedure for Google Software Engineers. - The others are candidates who were very close to the Google Software Engineering qualifications, and who in addition had a set of technical skills that is useful to SRE but is rare for most software engineers. By far, UNIX system internals and networking (Layer 1 to Layer 3) expertise are the two most common types of alternate technical skills we seek. - Team of people who (a) will quickly become bored by performing tasks by hand, and (b) have the skill set necessary to write software to replace their previously manual work - Google places a 50% cap on the aggregate "ops" work for all SREs - We want systems that are automatic, not just automated.
  4. Error Budget - Stating from measuring the reliability of the

    system - Naive approach: Availability = uptime / total time - Better approach: Availability = normal interactions / total interactions - In case of web service: successful requests / totla requests
  5. Error Budget - How much error are we willing to

    accept while releasing new software that could have bugs - Error budget = 100% - Target level of availability - Actually a “budget” we can spend every month - Benefit - Risk management by dev team. - Aligns incentives and emphasizes joint ownership between SRE and product development. - Make it easier to decide the rate of releases and to effectively discuss about it.
  6. Service Level Terminology - SLI (service level indicator) - SLO

    (service level objective): a target value or range of values for a service level that is measured by an SLI - SLA (service level agreements)
  7. Indicators in Practice What Do You and Your Users Care

    About? - User-facing serving systems: availability, latency, and throughput - Storage systems: latency, availability, and durability - Big data systems: throughput and end-to-end latency - also, correctness
  8. Simplicity - System Stability Versus Agility - The Virtue of

    Boring - “Unlike a detective story, the lack of excitement, suspense, and puzzles is actually a desirable property of source code.” - Remove unnecessary code - every new line of code written is a liability. - The "Negative Lines of Code" Metric - "software bloat" - A smaller project is easier to understand, easier to test, and frequently has fewer defects. - Minimal APIs - "perfection is finally attained not when there is no longer more to add, but when there is no longer anything to take away" by Antoine de Saint Exupery - Modularity (in terms of the design of distributed systems) - Release Simplicity
  9. What SREs do - Monitoring - + alerts, tickets, logging

    - Emergency Response - Change Management - progressive rollouts, detecting problems, rolling back - Demand Forecasting and Capacity Planning - Provisioning - combines both change management and capacity planning
  10. Emergency Response - Things break; that’s life. - First of

    all, Don’t Panic! - You aren’t alone, and the sky isn’t falling - Pull in more people
  11. Test-Induced Emergency - Google has adopted a proactive approach to

    disaster and emergency testing - SREs break our systems, watch how they fail, and make changes to improve reliability and prevent the failures from recurring - To identify some weaknesses or hidden dependencies and document follow-up actions to rectify the flaws we uncover
  12. Postmortem: Learning from Failure Primary Goal - the incident is

    documented - all contributing root cause(s) are well understood - effective preventive actions are put in place to reduce the likelihood and/or impact of recurrence Should be a “blameless” postmortem
  13. Postmortem: Learning from Failure Common postmortem triggers - User-visible downtime

    or degradation beyond a certain threshold - Data loss of any kind - On-call engineer intervention (release rollback, rerouting of traffic, etc.) - A resolution time above some threshold - A monitoring failure (which usually implies manual incident discovery)
  14. Postmortem Culture in Google - Postmortem of the month -

    Google+ postmortem group - Postmortem reading clubs - Wheel of Misfortune - Disaster Role Playing Game - The formula is straightforward and bears some resemblance to a tabletop RPG (Role Playing Game): the "game master" (GM) picks two team members to be primary and secondary on-call; these two SREs join the GM at the front of the room. An incoming page is announced, and the on-call team responds with what they would do to mitigate and investigate the outage.