SRE: Mitos e Lendas

SRE: MITOS E LENDAS

https://www.youtube.com/watch?v=fsTpRx8Pt-k - Love DevOps? Wait until you meet SRE (Nick
Wright - Atlassian)

• SLI - Service Level Indicator • SLO - Service
Level Objective • SLA - Service Level Agreement • Error Budget KEY PRINCIPLES

• How well your service is doing? • Performance it's
acceptable or not? • Metrics • Perceived quality from the end-users perspective • Common SLI's: Availability, Throughput, Latency, etc SLI - SERVICE LEVEL INDICATOR

• If we were operating a service that was offline
for maintenance for about 53 minutes over the course of the last year, we could claim that the service had 99.99% availability for the same period. • Availability Table: https://landing.google.com/sre/sre-book/chapters/availability-table/ SLI EXAMPLE: AVAILABILITY

• At a certain window of time this is my
target • How well I'm doing against it? • Generally consists of 3 parts: - Description of the thing that we are measuring (SLI) - Expected service level expressed as a percentage - Period where the measurement takes place SLO - SERVICE LEVEL OBJECTIVE

• The system’s uptime when measured in a period of
a single month must be at least 99% • The response time for 95% of service requests to X, when measured in a period of a year, must not exceed 100ms • The cpu utilisation for the database, when measured in a period of a day, must be in the range [40%-70%] SLO EXAMPLE

• Contract between a service provider and one or more
service consumers. • Willing to do (refund money for example) if you are failing to meet the objectives. • The SLA outlines a set of SLOs that have to be met and the consequences for both meeting and failing to meet them SLA - SERVICE LEVEL AGREEMENT

• Teams will resolve reported issues with Product X within
24 hours. But that same SLA doesn’t spell out what happens if the client takes 24 hours to send answers or screenshots to help your team diagnose the problem. Does it mean the team’s 24-hour window been eaten up by client slow-downs or does the clock start and stop based on when clients respond? • PRO-TIP: tech should be involved in the creation of SLAs SLA EXAMPLE

• How to measure the quality of the service? Availability
= (Nº of GOOD interactions) - (Nº of TOTAL interactions) POs and SREs define an availability target • Error Budget = (100% reliability) - (SLO) Example: 99.9% of SLO = 43min error budget a month • Developers can manage the risk themselves: decide how to spend their error budget • Developer teams become self-policing • Shared responsibility for system uptime: infrastructure failures eat into the devs error budget ERROR BUDGET

• As long as there is error budget remaining, developers
can ship new features to improve the overall quality of the product • Ops engineers can focus more heavily on long-term reliability projects, such as database maintenance and process automation • But when the error budget begins running low, developers will need to slow down or freeze feature work — and work closely with the ops team to restabilize the system before any SLAs or SLOs are violated. • Error budgets act as a quantifiable method for aligning the work and goals of Developers and Ops engineers. ERROR BUDGET

"SRE é sobre achar a CADÊNCIA ideal para o nosso
serviço ou software. É o balanço correto entre a inovação (lançamento de novas features) e estabilidade." - Andre Almar

• Site Reliability Engineer • Production Engineer • Infrastructure Engineer
• Systems Engineer • DevOps Engineer • Cloud Operations Engineer • Cloud Engineer • Operations Engineer • Analista de DevOps JOB TITLES

SREs are very focused on efficiency, automation, and reducing costs—taking
manual and repetitive tasks and automating them. There is an emphasis on not reinventing the wheel. For example, there is one way to do monitoring. People can just use the monitoring solution and go and do other stuff.

https://github.com/andrealmar/sre-university

• Software engineering • Distributed systems design • Operating systems
• Networking • Databases • Security • Reliability best practices • Troubleshooting • Customer support KEY AREAS

SRE Compatibility Quiz 1. Do you like thinking about large
scale problems that have a lot of moving parts? 2. Do you like thinking about how to make large systems more reliable? 3. Are you okay with working on software that will likely never be overtly seen by an external user? 4. Do you enjoy looking at a terminal for large amounts of time? 5. Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves system level problems that you cannot always see? 6. Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level functionality (e.g. ssh, proc, cron, swaps, etc.)? 7. Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario where something needs to be fixed? 8. Are you able to stay calm under pressure? 9. Do you approach problems in a logical, process-oriented way? 10. Are you comfortable attempting a problem that has never been solved before? 11. Are you someone who thinks about how you can make things better?

[email protected], @andrealmar_ on Twitter, @andrealmar on Instagram andrealmar.com This presentation
is available at: https://github.com/andrealmar/talks

SRE: Mitos e Lendas

SRE: Mitos e Lendas

Andre Almar

More Decks by Andre Almar

Other Decks in Technology

Featured

Transcript

SRE: MITOS E LENDAS

https://www.youtube.com/watch?v=fsTpRx8Pt-k - Love DevOps? Wait until you meet SRE (Nick

• SLI - Service Level Indicator • SLO - Service

• How well your service is doing? • Performance it's

• If we were operating a service that was offline

• At a certain window of time this is my

• The system’s uptime when measured in a period of

• Contract between a service provider and one or more

• Teams will resolve reported issues with Product X within

• How to measure the quality of the service? Availability

• As long as there is error budget remaining, developers

"SRE é sobre achar a CADÊNCIA ideal para o nosso

• Site Reliability Engineer • Production Engineer • Infrastructure Engineer

SREs are very focused on efficiency, automation, and reducing costs—taking

https://github.com/andrealmar/sre-university

• Software engineering • Distributed systems design • Operating systems

SRE Compatibility Quiz 1. Do you like thinking about large

[email protected], @andrealmar_ on Twitter, @andrealmar on Instagram andrealmar.com This presentation