Upgrade to Pro — share decks privately, control downloads, hide ads and more …

[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues

[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues

A personal overview of a journey through SRE principles, ideas and best practices and how this is actually working across several teams. Presented by an old linux sysadmin guy.

DevOps Lisbon

January 13, 2020
Tweet

More Decks by DevOps Lisbon

Other Decks in Technology

Transcript

  1. Agenda ▸ What in the world is SRE? ▸ SRE

    and DevOps ▸ SRE work at OLX Group 2
  2. Who is this guy Luis Rodrigues ▸ OLX SRE for

    3 years ▸ Former freelance jack of all trades ▸ Failed punk, ex-geologist wannabe ▸ Opinions are my own! But based on a lot of other people ones ▸ You can find me in at @luisvegeta 3
  3. Sysadmin life before SRE 4 Things break. Break again. And

    again. Sysadmins Overloaded. Constant firefighting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!
  4. And everything changes! 6 Things break. Break again. And again.

    SRE Overloaded. Constant firefighting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!
  5. 7 Changing job titles or adding individual skills doesn’t make

    systems administrators SREs. Damon Edwards Co-Founder of Rundeck Inc
  6. Google created SRE “In general, an SRE team is responsible

    for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” 9 Ben Treynor VP of Engineering at Google
  7. Embracing Risk Assessment, management and error budgets 10 And it’s

    principles Service Level Objectives Service indicators from objectives and agreements Toil Eliminate repetitive work that scales with service growth Monitoring Reliability comes from a good ability to observe Release Engineering Changes provoke most of the outages Automation Automate as much as possible Simplicity As simple as possible, no simpler
  8. And google engineers (literally) wrote the book on it 11

    SLIs, SLOs, SLAs, What?! https://sre.xyz
  9. 12 Site Reliability Principles SRE needs Service Level Objectives, with

    consequences SREs have time to make tomorrow better than today. SRE teams have the ability to regulate their workload
  10. 13 SRE needs SLOs, with consequences A target value or

    range for a service level measured by an SLI. - Your service is up enough. - Your HTTP server responds with success often and fast enough. Service-Level Objective (SLO) Represents the amount of failure we expect to actually have. Error Budget A quantitative measure of some aspect of the level of service that is provided. “A quantifiable measure of service reliability” Examples: request latency, throughput, availability, error rate Service-Level Indicator (SLI)
  11. SREs have time to make tomorrow better than today SRE

    teams need to be able to both run your systems and make them better. They can’t be buried in operational work. 14 Toil Engineering work Reduce toil improve the business E.W. No capacity to improve the business Toil No capacity to reduce toil
  12. SRE teams have the ability to regulate their workload Prioritize

    giving the most mission-critical systems to your SREs. Teams need space to flourish and grow. Share the responsibility of running services with the rest of the dev team. 15
  13. - Operations - Incident management - Post Mortems - Monitoring/Alerting

    - Capacity planning SRE vs DevOps? - Delivery - Release automation - Environment builds - Config management - Infrastructure as code Reliability Delivery Speed SRE DevOps 17
  14. SRE and DevOps: teams org model 18 SRE team Cross-functional

    team #1 Cross-functional team #2 Cross-functional team #3 Cross-functional team #4 Development team #1 Development team #2 Development team #3 Development team #4 SRE team Squad #1 Squad #2 Squad #3 Squad #4 Clear handoff requirements Error budget consequences
  15. We already do Devops! Can we start doing SRE? Forty-six

    percent of the principles in the book work out of the box Fifty percent of the principles are good advice There’s a small number — 4% — that you should not execute. 19 Forrester research blog post
  16. 3 SRE work at OLX Group What we do, did

    and what we are planning to do
  17. SRE team and development packs 21 SRE leads Pack #1

    Pack #2 Pack #3 Pack #N Head of Infrastructure SRE leads Pack #1 Pack #2 Pack #3 Pack #N SRE leads Pack #1 Pack #2 Pack #3 Pack #N HUB #1 HUB #2 HUB #3
  18. Automation Automate as much as possible. Automate everything! 22 SRE

    principles at OLX Everything is ephemeral Servers will die on you, network will fail. Maybe even DNS. Infrastructure as code Everything must be declared, pull requests approvals, etc. Monitoring Everything is monitored. RED metrics for all services Incident management SREs and Devs oncall for all services. All user impacting incidents require a postmortem and action points Alerting Smart alerts trough Pagerduty and Slack
  19. Automation: Atlantis Terraform Pull Request Automation Make Terraform changes visible

    to your team. Enable all engineers to collaborate on Terraform. Standardize your Terraform workflows. 23 https://www.runatlantis.io
  20. Monitoring: guidelines 24 - USE Method Utilization, Saturation, Errors -

    RED Method Requests, Errors, Duration - Four Golden Signals Latency, traffic, errors, and saturation RED Method Rate the number of requests, per second, you services are serving. Errors the number of failed requests per second. Duration distributions of the amount of time each request takes.
  21. Incident Management ▸ Classification: P1, P2 or bug ▸ Incident

    triggering ▹ Monitoring ▹ Slack bot ▸ Incident handling ▹ Oncall or dev teams ▹ First Responders Team (FRT) ▹ War rooms ▸ Blameless Postmortems 25
  22. Simplicity: less is more ▸ Stack level ▹ From datacenter

    to managed K8s ▹ Removed and simplified several layers ▸ Application level ▹ Adoption of managed services ▸ Monitoring level ▹ RED method ▹ Unification of tools 26