O que é SRE? Onde vivem? O que comem? Nessa talk você vai aprender mais sobre esse ser único....
MITOS E LENDAS
https://www.youtube.com/watch?v=fsTpRx8Pt-k - Love DevOps? Wait until you meet SRE (Nick Wright - Atlassian)
● SLI - Service Level Indicator
● SLO - Service Level Objective
● SLA - Service Level Agreement
● Error Budget
● How well your service is doing?
● Performance it's acceptable or not?
● Perceived quality from the end-users perspective
● Common SLI's: Availability, Throughput, Latency, etc
SLI - SERVICE LEVEL INDICATOR
● If we were operating a service that was offline for
maintenance for about 53 minutes over the course of the
last year, we could claim that the service had 99.99%
availability for the same period.
● Availability Table:
SLI EXAMPLE: AVAILABILITY
● At a certain window of time this is my target
● How well I'm doing against it?
● Generally consists of 3 parts:
- Description of the thing that we are measuring (SLI)
- Expected service level expressed as a percentage
- Period where the measurement takes place
SLO - SERVICE LEVEL OBJECTIVE
● The system’s uptime when measured in a period of a
single month must be at least 99%
● The response time for 95% of service requests to X,
when measured in a period of a year, must not exceed
● The cpu utilisation for the database, when measured in a
period of a day, must be in the range [40%-70%]
● Contract between a service provider and one or more service
● Willing to do (refund money for example) if you are failing to meet the
● The SLA outlines a set of SLOs that have to be met and the
consequences for both meeting and failing to meet them
SLA - SERVICE LEVEL AGREEMENT
● Teams will resolve reported issues with Product X within 24 hours.
But that same SLA doesn’t spell out what happens if the client takes 24 hours to
send answers or screenshots to help your team diagnose the problem.
Does it mean the team’s 24-hour window been eaten up by client slow-downs or does
the clock start and stop based on when clients respond?
● PRO-TIP: tech should be involved in the creation of SLAs
● How to measure the quality of the service?
Availability = (Nº of GOOD interactions) - (Nº of TOTAL interactions)
POs and SREs define an availability target
● Error Budget = (100% reliability) - (SLO)
Example: 99.9% of SLO = 43min error budget a month
● Developers can manage the risk themselves: decide how to spend their
● Developer teams become self-policing
● Shared responsibility for system uptime: infrastructure failures eat
into the devs error budget
● As long as there is error budget remaining, developers can ship new
features to improve the overall quality of the product
● Ops engineers can focus more heavily on long-term reliability projects,
such as database maintenance and process automation
● But when the error budget begins running low, developers will need to slow
down or freeze feature work — and work closely with the ops team to
restabilize the system before any SLAs or SLOs are violated.
● Error budgets act as a quantifiable method for aligning the work and goals
of Developers and Ops engineers.
"SRE é sobre achar a CADÊNCIA ideal para o
nosso serviço ou software. É o balanço
correto entre a inovação (lançamento de
novas features) e estabilidade." - Andre
● Site Reliability Engineer
● Production Engineer
● Infrastructure Engineer
● Systems Engineer
● DevOps Engineer
● Cloud Operations Engineer
● Cloud Engineer
● Operations Engineer
● Analista de DevOps
SREs are very focused on efficiency, automation, and reducing costs—taking
manual and repetitive tasks and automating them. There is an emphasis on not
reinventing the wheel. For example, there is one way to do monitoring. People can
just use the monitoring solution and go and do other stuff.
● Software engineering
● Distributed systems design
● Operating systems
● Reliability best practices
● Customer support
SRE Compatibility Quiz
1. Do you like thinking about large scale problems that have a lot of moving parts?
2. Do you like thinking about how to make large systems more reliable?
3. Are you okay with working on software that will likely never be overtly seen by an external user?
4. Do you enjoy looking at a terminal for large amounts of time?
5. Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves
system level problems that you cannot always see?
6. Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level
functionality (e.g. ssh, proc, cron, swaps, etc.)?
7. Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario
where something needs to be fixed?
8. Are you able to stay calm under pressure?
9. Do you approach problems in a logical, process-oriented way?
10. Are you comfortable attempting a problem that has never been solved before?
11. Are you someone who thinks about how you can make things better?
@andrealmar_ on Twitter,
@andrealmar on Instagram
This presentation is available at: