Site Reliability Engineering

Slide 1

Slide 1 text

Site Reliability Engineering Gorka López de Torre Querejazu Senior Consultant | ThoughtWorks @gorkaio operations as software engineering

Slide 2

Slide 2 text

Gorka López de Torre Querejazu Senior Consultant | ThoughtWorks https://gorka.io @gorkaio ● Software Engineer ● MSc Information Technology Management ● GDG Vitoria Co-Organizer

Slide 3

Slide 3 text

What is SRE?

Slide 4

Slide 4 text

Software Lifecycle What is SRE?

Slide 5

Slide 5 text

DevOpsInterface A set of practices, guidelines and culture designed to break down silos in IT development, operations, architecture, networking and security 5 key areas ● Reduce organisational silos ● Accept failure as normal ● Implement gradual changes ● Leverage tooling and automation ● Measure everything What is SRE?

Slide 6

Slide 6 text

Class SRE implements DevOpsInterface A set of practices we’ve found to work, some beliefs that animate those practices, and a job role. 5 key areas ● Reduce organisational silos: share ownership ● Accept failure as normal: Error Budgets & blameless postmortems ● Implement gradual changes: CI/CD, FT... reduce cost of failure ● Leverage tooling and automation: automate common cases ● Measure everything: measure toil and reliability What is SRE?

Slide 7

Slide 7 text

SRE approach to operations ● Data driven decision making ● Treat operations like a software engineering problem: ○ Hire people motivated and capable of writing automation ○ Use software to accomplish tasks normally done by sysadmins ○ Design more reliable and operable service architectures from the start What is SRE?

Slide 8

Slide 8 text

What do SRE teams do? SRE develops solutions to design, build and run large-scale systems scalably, reliably and efficiently. SRE guides system architecture by operating at the intersection of software development and systems engineering. SRE is a job function, a mindset and a set of engineering approaches to running better production systems. We approach our job with a spirit of constructive pessimism. We hope for the best, but plan for the worst. Hope is not an strategy. What is SRE?

Slide 9

Slide 9 text

SRE Practices

Slide 10

Slide 10 text

Areas of practice Monitoring & Alerting Capacity Planning Change Management Emergency Response Culture SRE Practices

Slide 11

Slide 11 text

Monitoring & Alerting Monitoring Automate recording system metrics Alerting Trigger notifications when conditions are detected. ● Page: immediate human response required. ● Ticket: A human needs to take action, but not immediately Only involve humans when SLO is threatened Humans should never watch dashboards, read log files, and so on to determine whether the system is ok. SRE Practices

Slide 12

Slide 12 text

Capacity Planning Plan for organic growth Increased product adoption and usage by customers Determine inorganic growth Sudden jumps in demand due to feature launches, marketing campaigns,... Correlate raw resources to service capacity Make sure you have enough spare capacity to meet your reliability goals. Optimize utilisation Capacity can be expensive! (and so can be downtimes) SRE Practices

Slide 13

Slide 13 text

Change Management Progressive rollouts CI/CD, Feature Toggles,... Quickly and accurately detect problems Monitoring and alerting. Rollback changes safely and quickly when problems arise Rolling back should be as easy as pushing a button. Changes imply risk ~70% outages are due to changes in live systems Remove humans from the loop ● Reduce errors ● Reduce fatigue ● Improve velocity Spend Error Budget to increase velocity The goal is not zero outages, but maximum velocity within the error budget. SRE Practices

Slide 14

Slide 14 text

Emergency Response Thresholds Define incident & postmortem criteria ● User-visible downtime or degradation ● Data loss ● On-call engineer significant intervention (ie: rollback) ● Resolution time above threshold Postmortem ● Document the incident ● Understand contributing root causes ● Plan preventive actions to reduce the likelihood and/or impact of recurrence Things break. That’s a fact. ● Maintain your runbooks updated. ● Don’t panic! ● Mitigate, troubleshoot, fix. ● Overwhelmed? Pull in more people (freehunting). Automate incident & postmortem criteria SLO breaches causing fast error budget burning should easily be spotted as incidents. Writing a postmortem is not punishment Postmortems are a chance to improve system reliability. SRE Practices

Slide 15

Slide 15 text

Culture Blamelessness ● Focus on contributing causes, not teams or individuals ● Borrow Retrospective Prime Directive from Agile ● Human errors are system problems ● If a culture of finger pointing prevails, people will not bring issues to light for fear of punishment Reduce toil work ● Manual ● Repetitive ● Automatable ● Tactical ● Without enduring value ● Grows with service growth Toil management ● You can’t automate everything ● If it can be automated, it probably should be automated ● If you do enough ops work, you’ll know what to automate Team skills ● Good software engineers, good systems engineers ● Try to get a 50:50 mix ● Everyone should be able to code Management skills ● Avoid operational burden; keep team healthy ● Blamelessness & agility ● Product thinking ● Ops work should be around 50% ● Other 50% should be development ○ Automation ○ Improvement ○ Toil reduction SRE Practices

Slide 16

Slide 16 text

Service Level Objectives

Slide 17

Slide 17 text

What are we trying to fix? ● Understand impact without in-depth service knowledge. ● Focus on things that matter. ● Reduce alert fatigue. ● Set reliability expectations. ● Have a clear and shared criteria of good/bad. ● Balance reliability and development velocity. But we already have dashboards and alarms...? ● Can anyone in the company understand if your service is healthy, just by looking at your dashboard? ● Do your users know the reliability they can expect? ● Does it matter if a tree falls in a desert forest? Does this replace our current dashboards? ● No. SLO dashboards talk about system reliability and the impact it has on users, the symptoms you are facing. Debugging dashboards help you find the root cause. When do we tackle tech debt, improve performance...? ● When it affects reliability. SLOs

Slide 18

Slide 18 text

Indicators, Objectives, Agreements SLOs

Slide 19

Slide 19 text

It’s all about the user experience SLOs

Slide 20

Slide 20 text

SLI · Service Level Indicator What is it? Quantifiable measurement of reliability for a specific service capability, often aggregated to form rates, averages or percentiles. SLI = good / valid Examples ● Ratio of home page GET requests served faster than a threshold. ● Ratio of home page GET requests served successfully. What is “good”? ● Depends on what you are trying to measure. ● GET requests to an existing HTTP endpoint: ○ Are all non 5xx responses “good”? What is “valid”? ● Depends on what you are trying to measure. ● GET requests to an existing HTTP endpoint: ○ Are non-authenticated requests “valid”? Why use a ratio? ● All values between 0 (everything KO) and 1 (everything OK) ● Easier to take advantage of tooling SLOs

Slide 21

Slide 21 text

SLI · Service Level Indicator ● Request/Response ○ Availability: % valid requests served successfully ○ Latency: % valid requests served faster than threshold ○ Quality: % valid requests served without degrading quality ● Data processing ○ Coverage: % valid data processed successfully ○ Freshness: % valid data updated more recently than threshold ○ Correctness: % valid data producing correct output ● Storage ○ Durability: % written data than can be successfully re-read Isn’t it tricky? ● A lot. Consider the impact of 404 responses on latency. Some categories are harder than others ● Availability/Latency are usually the easiest ones. ● Correctness can be particularly hard. Start small, iterate and fine-tune ● Start with easy to define SLIs with good ROI SLOs

Slide 22

Slide 22 text

SLI · Service Level Indicator Specification (WHAT?) Ratio of home page GET requests served faster than a threshold. Implementation (HOW?) ● An SLI specification can have multiple SLI Implementations ● Each implementation has advantages and drawbacks: ○ Implementation feasibility ○ Accuracy ○ Cost/Effort SLI Specification: % “/hello” GET requests served faster than a threshold. SLI Implementations: ● % “/hello” GET requests served faster than 100ms, measured at the load balancer. ● % “/hello” GET requests served faster than 250ms, measured at the client browser. Where do those numbers come from? ● Target thresholds reflect our past experience or knowledge about user happiness thresholds. ● They should be reasonable indicators of user happiness. ● Product people must be involved in the SLI/SLO definitions. What are the trade-offs? ● Accuracy degrades the farther we are from the user. ● Cost/Effort is usually lower at the more internal levels. ● Some levels might obscure problems further down. ● Telemetry at the client level might have legal implications. ● We can’t do much about carrier network reliability or coverage. ● ... SLOs

Slide 23

Slide 23 text

SLO · Service Level Objective What is it? A target for SLIs aggregated over a rolling time window. SLO = sum(SLI met) / window >= target SLI: ● % “/hello” GET requests served faster than 100ms, measured at the load balancer. SLO: ● 99.5% “/hello” GET requests served faster than 250ms, measured at the client browser, in a rolling window of 30 days. Where do those objectives come from? ● Target objectives reflect our past experience or knowledge about user happiness thresholds, balanced with what is realistically achievable within effort/cost. ● Product people must be involved in the SLI/SLO definitions. 100% is the wrong reliability target for basically everything ● Effort and cost grow exponentially. ● Most users won’t notice a difference from 99.9% to 100% ● Be as reliable as needed, but no more. SLOs

Slide 24

Slide 24 text

SLA · Service Level Agreement What is it? An agreement with our users, generally in the form of a contract, which details the level of reliability that we are committed to deliver and the consequences of failing to meet that agreement. SLI: ● % “/hello” GET requests served faster than 100ms, measured at the load balancer. SLO: ● 99.5% “/hello” GET requests served faster than 250ms, measured at the client browser, in a rolling window of 30 days. SLA: ● 99% “/hello” GET requests served faster than 250ms, measured at the client browser, in a rolling window of 30 days. In the event we do not meet this commitment, you will be eligible to receive a Service Credit. SLO >>> SLA ● Make your SLOs more restrictive than your SLAs! SLOs

Slide 25

Slide 25 text

Error Budget

Slide 26

Slide 26 text

Error Budget What is it? ● Control mechanism for diverting attention to reliability as needed. ● Ratio of failure time agreed to be acceptable without consequences. ● An opportunity to innovate, increase velocity and take risks. ErrorBudget = 1 - SLO Take risks! The goal is not zero outages, but maximum velocity within error budget. Be as reliable as needed, but no more It might be desirable to forcibly deplete our error budget, ensuring our users do not depend on a higher level of reliability than the one we committed to. What if our users complain? What if they don’t? We probably failed to set the right objective, and will need to reevaluate the velocity/reliability balance. Error Budget

Slide 27

Slide 27 text

Error Budget Reliability Level Allowed unreliability window per year per quarter per 30 days 90% 36.5 days 9 days 3 days 95% 18.25 days 4.5 days 1.5 days 99% 3.65 days 21.6 hours 7.2 hours 99.5% 1.83 days 10.8 hours 3.6 hours 99.9% 8.76 hours 2.16 hours 43.2 minutes 99.95% 4.38 hours 1.08 hours 21.6 minutes 99.99% 52.6 minutes 12.96 minutes 4.32 minutes 99.999% 5.26 minutes 1.30 minutes 25.9 seconds Error Budget

Slide 28

Slide 28 text

Error Budget Policy What is it? A team agreement on how we are going to react to Error Budget consumption or depletion. SLO miss policy: ● “We must work on reliability if...” ● “We may continue to work on non-reliability features if...” Outage policy: “If a single class of outage consumes more than 20% of error budget over a cycle, we must have an objective to address the issues in the following OKR cycle.” Escalation policy: “In the event of disagreement, the issue should be escalated to the Head of Technology to make a decision.” Team agreement The Error Budget Policy must be a team agreement between all parties (EM, Product, Engineers,...) Constructive Policy ● The policy is not intended to serve as a punishment for missing SLOs. ● Halting change is undesirable. This policy gives teams permission to focus exclusively on reliability when data indicates that it is more important than other product features. Iterative policy Review every few months. Error Budget

Slide 29

Slide 29 text

A small teaser

Slide 30

Slide 30 text

SLO Dashboard Datadog SLO Dashboards ● Metric/Monitor based ● Automatic Error Budget calculation ● Integration with 3rd party tools Small Teaser

Slide 31

Slide 31 text

Alerting Opsgenie on-call schedules ● Automatic on-call schedules ● Advanced notification configurations ● Allows escalation policies ● Integration with other tools Small Teaser

Slide 32

Slide 32 text

Incident Management & Postmortems Incident Tickets & Postmortem docs ● API available for integrations ● Create new incidents and postmortems ○ Jira incident ticket ○ Blank postmortem from template ○ Incident Slack channel ○ Pull-in relevant people Small Teaser

Slide 33

Slide 33 text

Other integrations Integration examples ● Pending PRs send Slack notifications to the team owning the repo (inc. dependabot) ● Toil work calculation and ops effectiveness ● Jenkins pipelines send Slack notifications to the team owning the project ● Pingdom checks configured through automatic Jira ticket processing ● AWS cost tracking is updated per team/project in a Jupyter Notebook ● ... Small Teaser

Slide 34

Slide 34 text

Glue everything together --- Service: product: quizfoo name: FooService criticality: 3 description: "Foo Service does Foo things” slos: - name: "Home page availability" description: "Ratio of time home-page was available" target: 99 slis: - Good: "sum:aws.elb.httpcode_elb_2xx{name:fooservice}.as_count()" valid: "sum:aws.elb.request_count{name:fooservice}.as_count()" team: foo-fighters ... The best way to write a book is to write the first line ● Service information resides in a YAML file in the repository: ○ Service name ○ Service description ○ Service criticality ○ Owning team ○ SLO definitions ○ ... ● Service documentation and runbook reside in the repository in parsable format (markdown) ● Lambdas parse information from there and generate: ○ Service catalog entry ○ Service documentation ○ Service runbook ○ SLO dashboards ○ ... Small Teaser

Slide 35

Slide 35 text

Where do I start?

Slide 36

Slide 36 text

Do these four things first 1. Hire people who write software 2. Start with Service Level Objectives 3. Ensure parity of respect 4. Provide a feedback loop for self-regulation Take one step at a time ● Choose one service to run according to SRE model. ● Empower the team with strong executive sponsorship and support. ● Culture and psychological safety is critical. ● Measure SLOs and team health. ● Incremental progress frees time for more progress. ● Spread the techniques and knowledge once you have a solid case study within your company. Remember that... ● Automation and engineering for operability enable scaling systems without scaling organisations. ● Tension between product development and operations doesn’t need to exist. ● Error budgets provide measurement and flexibility to deliver both reliability and product velocity. Where to start

Slide 37

Slide 37 text

How do we define our first SLOs? 1. Choose one service 2. Understand your users. 3. For one type of user and their main capability: a. What guarantees would they like to have? b. What guarantees they think they have but don’t? c. What makes them happy/upset? 4. Choose one thing to measure and how to measure it (SLI) 5. Set a feasible objective based on your experience or aspiring goal (SLO) 6. Ensure the resulting error budget is accepted as normal. 7. Define actions to take if you fail to deliver the desired level of reliability (EB Policy) 8. Iterate, fine-tune and extend. What are “users”? ● Any person, or other system, that uses your service. What guarantees they think they have? ● Users tend to have unrealistic assumptions (100% availability). ● Think what would be acceptable, not ideal. How do we choose an objective? You may use historical data to get started. You will have the chance to iterate and fine-tune the objective later. Where to start

Slide 38

Slide 38 text

Further reference https://google.com/sre Where to start

Slide 39

Slide 39 text

Questions?

Slide 40

Slide 40 text

Thank You! Gorka López de Torre Querejazu Senior Consultant | ThoughtWorks @gorkaio