Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SRE: Mitos e Lendas

Andre Almar
November 10, 2020

SRE: Mitos e Lendas

O que é SRE? Onde vivem? O que comem? Nessa talk você vai aprender mais sobre esse ser único....

Andre Almar

November 10, 2020
Tweet

More Decks by Andre Almar

Other Decks in Technology

Transcript

  1. SRE:
    MITOS E LENDAS

    View Slide

  2. View Slide

  3. https://www.youtube.com/watch?v=fsTpRx8Pt-k - Love DevOps? Wait until you meet SRE (Nick Wright - Atlassian)

    View Slide

  4. ● SLI - Service Level Indicator
    ● SLO - Service Level Objective
    ● SLA - Service Level Agreement
    ● Error Budget
    KEY PRINCIPLES

    View Slide

  5. ● How well your service is doing?
    ● Performance it's acceptable or not?
    ● Metrics
    ● Perceived quality from the end-users perspective
    ● Common SLI's: Availability, Throughput, Latency, etc
    SLI - SERVICE LEVEL INDICATOR

    View Slide

  6. ● If we were operating a service that was offline for
    maintenance for about 53 minutes over the course of the
    last year, we could claim that the service had 99.99%
    availability for the same period.
    ● Availability Table:
    https://landing.google.com/sre/sre-book/chapters/availability-table/
    SLI EXAMPLE: AVAILABILITY

    View Slide

  7. ● At a certain window of time this is my target
    ● How well I'm doing against it?
    ● Generally consists of 3 parts:
    - Description of the thing that we are measuring (SLI)
    - Expected service level expressed as a percentage
    - Period where the measurement takes place
    SLO - SERVICE LEVEL OBJECTIVE

    View Slide

  8. ● The system’s uptime when measured in a period of a
    single month must be at least 99%
    ● The response time for 95% of service requests to X,
    when measured in a period of a year, must not exceed
    100ms
    ● The cpu utilisation for the database, when measured in a
    period of a day, must be in the range [40%-70%]
    SLO EXAMPLE

    View Slide

  9. ● Contract between a service provider and one or more service
    consumers.
    ● Willing to do (refund money for example) if you are failing to meet the
    objectives.
    ● The SLA outlines a set of SLOs that have to be met and the
    consequences for both meeting and failing to meet them
    SLA - SERVICE LEVEL AGREEMENT

    View Slide

  10. ● Teams will resolve reported issues with Product X within 24 hours.
    But that same SLA doesn’t spell out what happens if the client takes 24 hours to
    send answers or screenshots to help your team diagnose the problem.
    Does it mean the team’s 24-hour window been eaten up by client slow-downs or does
    the clock start and stop based on when clients respond?
    ● PRO-TIP: tech should be involved in the creation of SLAs
    SLA EXAMPLE

    View Slide

  11. ● How to measure the quality of the service?
    Availability = (Nº of GOOD interactions) - (Nº of TOTAL interactions)
    POs and SREs define an availability target
    ● Error Budget = (100% reliability) - (SLO)
    Example: 99.9% of SLO = 43min error budget a month
    ● Developers can manage the risk themselves: decide how to spend their
    error budget
    ● Developer teams become self-policing
    ● Shared responsibility for system uptime: infrastructure failures eat
    into the devs error budget
    ERROR BUDGET

    View Slide

  12. ● As long as there is error budget remaining, developers can ship new
    features to improve the overall quality of the product
    ● Ops engineers can focus more heavily on long-term reliability projects,
    such as database maintenance and process automation
    ● But when the error budget begins running low, developers will need to slow
    down or freeze feature work — and work closely with the ops team to
    restabilize the system before any SLAs or SLOs are violated.
    ● Error budgets act as a quantifiable method for aligning the work and goals
    of Developers and Ops engineers.
    ERROR BUDGET

    View Slide

  13. "SRE é sobre achar a CADÊNCIA ideal para o
    nosso serviço ou software. É o balanço
    correto entre a inovação (lançamento de
    novas features) e estabilidade." - Andre
    Almar

    View Slide

  14. ● Site Reliability Engineer
    ● Production Engineer
    ● Infrastructure Engineer
    ● Systems Engineer
    ● DevOps Engineer
    ● Cloud Operations Engineer
    ● Cloud Engineer
    ● Operations Engineer
    ● Analista de DevOps
    JOB TITLES

    View Slide

  15. View Slide

  16. SREs are very focused on efficiency, automation, and reducing costs—taking
    manual and repetitive tasks and automating them. There is an emphasis on not
    reinventing the wheel. For example, there is one way to do monitoring. People can
    just use the monitoring solution and go and do other stuff.

    View Slide

  17. View Slide

  18. View Slide

  19. https://github.com/andrealmar/sre-university

    View Slide

  20. ● Software engineering
    ● Distributed systems design
    ● Operating systems
    ● Networking
    ● Databases
    ● Security
    ● Reliability best practices
    ● Troubleshooting
    ● Customer support
    KEY AREAS

    View Slide

  21. SRE Compatibility Quiz
    1. Do you like thinking about large scale problems that have a lot of moving parts?
    2. Do you like thinking about how to make large systems more reliable?
    3. Are you okay with working on software that will likely never be overtly seen by an external user?
    4. Do you enjoy looking at a terminal for large amounts of time?
    5. Do you enjoy the process of diagnosing and fixing a problem? If yes, what if the diagnosis involves
    system level problems that you cannot always see?
    6. Do you enjoy thinking about system information (e.g. disk space, cpu, os, kernel, etc.) and system level
    functionality (e.g. ssh, proc, cron, swaps, etc.)?
    7. Are you comfortable with the idea of being “on-call” in which you are likely to be in high-stakes scenario
    where something needs to be fixed?
    8. Are you able to stay calm under pressure?
    9. Do you approach problems in a logical, process-oriented way?
    10. Are you comfortable attempting a problem that has never been solved before?
    11. Are you someone who thinks about how you can make things better?

    View Slide

  22. View Slide

  23. [email protected],
    @andrealmar_ on Twitter,
    @andrealmar on Instagram
    andrealmar.com
    This presentation is available at:
    https://github.com/andrealmar/talks

    View Slide