Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2017.05 Meetup #13] [TALK #2] Ricardo Amaro - SRE

[2017.05 Meetup #13] [TALK #2] Ricardo Amaro - SRE

Site Reliability Engineering enables agility and stability. SREs use Software Engineering to automate themselves out of the Job. My advice, if you want to implement this change in your company is to start with action items, alter your training and hiring, implement error budgets, do blameless postmortems and reduce toil.

Ricardo Amaro is currently an SRE in one of the largest companies related to the world of free software.

DevOps Lisbon

May 15, 2017
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. Who am I? @DevOps @ricardoamaro Portugal Lisbon Drupal Community Software

    Freedom +9 years Drupal 90’s Linux Adopter 6 years at Acquia Site Reliability Engineer https://drupal.org/user/666176
  2. About Acquia Metrics ○ Acquia Cloud: ◦ # of Instances

    (18,000+) ◦ # of Production Sites (54,000+) ◦ # API Calls (3,000 + per sec) ◦ # Of Availability Zones (20+) ◦ # Of Regions (8) What we do? We host the biggest Drupal sites in the world.
  3. We will talk about A brief summary inspired on Google’s

    S.R.E. book ○ What is S.R.E? ○ Tenets of S.R.E. ○ Reliability & Toil ○ Error budget - keeping the Service Level Objective (SLO) ○ Development & Operations ○ Monitoring and Being On-Call ○ Postmortem culture - Learning from failure
  4. ➔ Term crafted by Google in 2003. ➔ When Ben

    Treynor was hired to run “production” and ended up “applying software engineering to an operations function” ➔ Motivation: “as a software engineer, how would I want to invest my time to accomplish a set of repetitive tasks?” Site Reliability Engineering
  5. SRE’s are engineers that... ➔ Apply the principles of computer

    science and engineering to design and develop large, distributed computing systems. ➔ Write software for those systems alongside product developers. ➔ Build all additional pieces those systems need, like backups and load balancing. ➔ Reuse old solutions for new problems. Site Reliability Engineering
  6. DevOps & S.R.E. DevOps is a practice, which was coined

    around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets. Source: https://en.wikipedia.org/wiki/Site_reliability_engineering
  7. And tools?! What about SRE tools? The SRE & DevOps

    Cultures are Openminded. For us, Opensource & FreeSoftware are of course the best options, because: - We can study how the program works, and change it so it does what you need. - Best agility for dealing with deployments, code fixes, debugging, etc - Give back your changes to the devops community. SRE is more about Culture and Process
  8. Tenets of SRE 1. Ensuring a Durable Focus on Engineering

    2. Pursuing Maximum Change Velocity 3. Monitoring 4. Emergency Response 5. Change Management 6. Demand Forecasting and Capacity Planning 7. Provisioning 8. Efficiency and Performance
  9. The latest feature or That the product works? What is

    most the important Feature of a product?
  10. The 80’s Waterfall software delivery model Operations @customer ➔ *Provisioning

    ➔ *Installing ➔ *Upgrading ➔ *Maintaining ➔ *Backups/Restore ➔ *Scaling Source: wikipedia
  11. ..and Then came the web... • Software as a Service

    • Platform as a Service • Cloud computing • ... ➔ Operations overhead not on the customer side ➔ Features could now be delivered faster ➔ Customer feedback important for product improvements Product Development Ship Features Operations Users
  12. Opposite rewarding conflicts Objectives: ➔ Ship new features ➔ Launch

    new products Objectives: ➔ Reliability & Availability ➔ Provision & Scale Dev Ops
  13. The problem: Toil* *exhausting labour ➔ Manual ➔ Repetitive ➔

    Automatable ➔ Tactical (Unplanned work) ➔ No enduring value ➔ O(n) with service growth (not just “work I don’t like to do.”)
  14. As your business succeeds workload tends to infinity (x) time

    • Cap Ops Workload Because if you are successful and your business grows you need to reduce errors and toil. Put a 50% cap on Ops work and leave most of the SRE team time for writing code and reduce Toil. (y) customers/traffic O(n) Workload/Toil over time
  15. ➔ Keep operational work (i.e., toil) below 50% of each

    SREs time ➔ More than 50% of each SREs time is spent on: ◆ Engineering project work to reduce toil ◆ Add service features - improving reliability, performance, utilization ➔ Improves career planning for the SRE ➔ Improves morale on the organization and your business ➔ An SRE team can easily devolve into an Ops team if the 50% target is broken Why less Toil is Better? S.R.E. - A modern solution not bad...
  16. S.R.E. - A modern solution DEV + OPS ➔ This

    conflict is not inevitable ➔ The solution is: Error Budgets! ➔ Everyone agrees on an Error Budget (as we will explain next) ➔ SRE only prevents releases or Launches if the Error Budget is exceeded. Dev Ops
  17. What is an Error Budget? The business or the product

    establishes Service Level Objectives (SLOs) for the system, based on Service Level indicators such as error rate, availability or latency... Error Budget Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget. 100% - 99.9% = 0.1%
  18. ➔ Error budget can be spent on anything: launching features,

    velocity, etc. ➔ Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors. ➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity. ➔ Out of Budget? No problems. Do more testing between releases. ➔ 100% is the wrong reliability target for basically everything. ➔ Set a goal that acknowledges the trade-off and leaves an error budget Use and maintain the Error Budget
  19. ➔ This puts an incentive to developers that drives them

    to value stability (not just change) ➔ And gives control that drives SREs to permit change (not just stability) ➔ It forces decisions based on metrics, not politics- nor feelings, just data Error Budget A Self-regulating mechanism
  20. ➔ Development and SRE teams share a single staffing pool

    ◆ If all is Reliable Devs are rewarded with teammates ◆ If Ops is overloaded, SREs are contracted to support code How are Development & Operations teams organized? Now tell me… Why should I hire you?
  21. Systems, code… Are you able to cook also? ➔ SREs

    are developer/sys-admin hybrids ◆ They perform more Dev work as things become stable Development & Operations Systems, code… Are you able to cook also?
  22. ➔ SRE can only spend up to 50% of their

    time on ops work ➔ If operational load exceeds 50%, the ops work overflows to Dev ➔ Allow SRE to move to other projects Highly motivated and effective teamwork
  23. ➔ Three valid kinds of monitoring output ◆ Alerts: human

    needs to take action immediately • If you get a huge volume of critical email alerts disable them and stick with paging ◆ Tickets: human needs to take action eventually • On-call engineers can actually accomplish work when they aren’t being kept up by pages at all hours. Ultimately, temporarily backing off on our alerts will allow you to make faster progress toward a better service ◆ Logging: no action needed Monitoring and taking action
  24. ➔ Maximum of 2 events per 8–12hour on-call shift ➔

    Handle the event accurately and quickly, clean up and restore normal service ➔ Conducting postmortems ➔ If more than 2 events occur regularly per on-call shift, problems can’t be investigated ➔ Pager fatigue also won’t improve with scale ➔ If they receive fewer than one event per shift, keeping them on point is a waste of their time Being On-Call
  25. ➔ An engineer can only react with urgency a few

    times a day before they get fatigued ➔ Every page should be actionable ➔ Every page response should require intelligence ➔ Pages should be about a new problem or an event that hasn’t been seen before Pager fatigue A serious a problem to be addressed
  26. Root Cause Analysis: The Core of Problem Solving and Corrective

    by Duke Okes https://www.amazon.com/Root-Cause-Analysis-Problem-Corrective/ dp/0873897641 Find and eliminate all root causes
  27. ➔ When humans are really necessary, thinking and recording the

    best practices ahead of time in a playbook or runbook improves 3x in the Mean Time To Repair (MTTR) ➔ SRE’s write and rely on on-call playbooks/runbooks Example: http://docs.ansible.com/ansible/playbooks_intro.html Playbooks/Runbooks
  28. A healthy monitoring and alerting pipeline should be simple and

    easy to reason about Monitoring Conclusion What do i do with this? ➔ Try always to have a high level stack overview ➔ Despite performance of services like databases often must be performed on the system itself ➔ A dashboard might also be paired with a log, in order to analyze historical correlations rapidly
  29. ➔ Document written for ALL significant incidents ➔ Non-paged incidents

    are even more valuable - monitoring gaps ➔ Explain what happened in detail ➔ Find all root causes of the event ➔ Assign actions to correct the problem or improve how it is addressed next time What are Postmortems? Postmortems?!
  30. Postmortems Are Blameless! ➔ Use a blame free postmortem culture,

    with the goal of exposing faults ◆ Apply engineering to fix these faults ◆ Try not just avoid or minimize them
  31. SERIOUSLY: BLAMELESS! The Field Guide to Understanding Human Error by

    Sidney Dekker https://www.amazon.com/Field-Guide-Understanding-Human -Error/dp/0754648265
  32. ➔ Hire only coders ➔ Have Service Level Objectives (SLOs)

    for your service ➔ Measure and report performance against SLOs ➔ Use Error Budgets and gate launches on them ➔ Have a Common staffing pool for SRE and DEV ➔ Excess Ops work overflows to the DEV team ➔ Cap SRE operational load at 50% and share 5% with the DEV team ➔ On-call teams at least 8 or 6 people in rotation, per product ➔ Maximum of 2 events per on-call shift ➔ Write Post mortems for every event ➔ They should be BLAMELESS and focus on process and technology, not people How to achieve S.R.E. Treynor’s Action items IMPORTANT IMPORTANT
  33. The S.R.E. Google Book and more resources • https://g.co/SREBook •

    There is now #SRE on @hangops Slack. https://t.co/btPgSGkGNz to join.