[2020.01 Meetup] [TALK 2] SRE is pretty much what you make of it- Luís Rodrigues

Slide 1

Slide 1 text

SRE is pretty much what you make of it DevOps meetup Lisbon

Slide 2

Slide 2 text

Agenda ▸ What in the world is SRE? ▸ SRE and DevOps ▸ SRE work at OLX Group 2

Slide 3

Slide 3 text

Who is this guy Luis Rodrigues ▸ OLX SRE for 3 years ▸ Former freelance jack of all trades ▸ Failed punk, ex-geologist wannabe ▸ Opinions are my own! But based on a lot of other people ones ▸ You can ﬁnd me in at @luisvegeta 3

Slide 4

Slide 4 text

Sysadmin life before SRE 4 Things break. Break again. And again. Sysadmins Overloaded. Constant ﬁreﬁghting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!

Slide 5

Slide 5 text

Then the company decides to implement SRE 5

Slide 6

Slide 6 text

And everything changes! 6 Things break. Break again. And again. SRE Overloaded. Constant ﬁreﬁghting. Waiting in ticket queues for everything. Everyone is busy, but it doesn’t get any better. Everything takes too long, cost too much and break too often!

Slide 7

Slide 7 text

7 Changing job titles or adding individual skills doesn’t make systems administrators SREs. Damon Edwards Co-Founder of Rundeck Inc

Slide 8

Slide 8 text

1 What in the world is SRE? Principles, ideas and sources of information

Slide 9

Slide 9 text

Google created SRE “In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” 9 Ben Treynor VP of Engineering at Google

Slide 10

Slide 10 text

Embracing Risk Assessment, management and error budgets 10 And it’s principles Service Level Objectives Service indicators from objectives and agreements Toil Eliminate repetitive work that scales with service growth Monitoring Reliability comes from a good ability to observe Release Engineering Changes provoke most of the outages Automation Automate as much as possible Simplicity As simple as possible, no simpler

Slide 11

Slide 11 text

And google engineers (literally) wrote the book on it 11 SLIs, SLOs, SLAs, What?! https://sre.xyz

Slide 12

Slide 12 text

12 Site Reliability Principles SRE needs Service Level Objectives, with consequences SREs have time to make tomorrow better than today. SRE teams have the ability to regulate their workload

Slide 13

Slide 13 text

13 SRE needs SLOs, with consequences A target value or range for a service level measured by an SLI. - Your service is up enough. - Your HTTP server responds with success often and fast enough. Service-Level Objective (SLO) Represents the amount of failure we expect to actually have. Error Budget A quantitative measure of some aspect of the level of service that is provided. “A quantiﬁable measure of service reliability” Examples: request latency, throughput, availability, error rate Service-Level Indicator (SLI)

Slide 14

Slide 14 text

SREs have time to make tomorrow better than today SRE teams need to be able to both run your systems and make them better. They can’t be buried in operational work. 14 Toil Engineering work Reduce toil improve the business E.W. No capacity to improve the business Toil No capacity to reduce toil

Slide 15

Slide 15 text

SRE teams have the ability to regulate their workload Prioritize giving the most mission-critical systems to your SREs. Teams need space to ﬂourish and grow. Share the responsibility of running services with the rest of the dev team. 15

Slide 16

Slide 16 text

2 SRE and DevOps class SRE implements DevOps

Slide 17

Slide 17 text

- Operations - Incident management - Post Mortems - Monitoring/Alerting - Capacity planning SRE vs DevOps? - Delivery - Release automation - Environment builds - Conﬁg management - Infrastructure as code Reliability Delivery Speed SRE DevOps 17

Slide 18

Slide 18 text

SRE and DevOps: teams org model 18 SRE team Cross-functional team #1 Cross-functional team #2 Cross-functional team #3 Cross-functional team #4 Development team #1 Development team #2 Development team #3 Development team #4 SRE team Squad #1 Squad #2 Squad #3 Squad #4 Clear handoff requirements Error budget consequences

Slide 19

Slide 19 text

We already do Devops! Can we start doing SRE? Forty-six percent of the principles in the book work out of the box Fifty percent of the principles are good advice There’s a small number — 4% — that you should not execute. 19 Forrester research blog post

Slide 20

Slide 20 text

3 SRE work at OLX Group What we do, did and what we are planning to do

Slide 21

Slide 21 text

SRE team and development packs 21 SRE leads Pack #1 Pack #2 Pack #3 Pack #N Head of Infrastructure SRE leads Pack #1 Pack #2 Pack #3 Pack #N SRE leads Pack #1 Pack #2 Pack #3 Pack #N HUB #1 HUB #2 HUB #3

Slide 22

Slide 22 text

Automation Automate as much as possible. Automate everything! 22 SRE principles at OLX Everything is ephemeral Servers will die on you, network will fail. Maybe even DNS. Infrastructure as code Everything must be declared, pull requests approvals, etc. Monitoring Everything is monitored. RED metrics for all services Incident management SREs and Devs oncall for all services. All user impacting incidents require a postmortem and action points Alerting Smart alerts trough Pagerduty and Slack

Slide 23

Slide 23 text

Automation: Atlantis Terraform Pull Request Automation Make Terraform changes visible to your team. Enable all engineers to collaborate on Terraform. Standardize your Terraform workﬂows. 23 https://www.runatlantis.io

Slide 24

Slide 24 text

Monitoring: guidelines 24 - USE Method Utilization, Saturation, Errors - RED Method Requests, Errors, Duration - Four Golden Signals Latency, traffic, errors, and saturation RED Method Rate the number of requests, per second, you services are serving. Errors the number of failed requests per second. Duration distributions of the amount of time each request takes.

Slide 25

Slide 25 text

Incident Management ▸ Classiﬁcation: P1, P2 or bug ▸ Incident triggering ▹ Monitoring ▹ Slack bot ▸ Incident handling ▹ Oncall or dev teams ▹ First Responders Team (FRT) ▹ War rooms ▸ Blameless Postmortems 25

Slide 26

Slide 26 text

Simplicity: less is more ▸ Stack level ▹ From datacenter to managed K8s ▹ Removed and simpliﬁed several layers ▸ Application level ▹ Adoption of managed services ▸ Monitoring level ▹ RED method ▹ Uniﬁcation of tools 26

Slide 27

Slide 27 text

27 THANKS! Any questions? You can ﬁnd me at: ▸ @luisvegeta ▸ [email protected]