Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SLO Review

SLO Review

2020-01-25 SRE NEXT 2020
https://sre-next.dev/schedule/#c4

Takeshi Kondo

January 25, 2020
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. SLO Review
    Takeshi Kondo / @chaspy
    2020/01/25
    SRE NEXT 2020 #srenext #srenextC

    View Slide

  2. Service Level Objectives

    View Slide

  3. Questions

    ✋Do you know the meaning of SLO?


    ✋Do you define SLO for your service?


    ✋Do you have an Error Budget Policy for your service?

    View Slide

  4. Target
    • People who want to know SLI/SLO

    • People who want to know how to use SLI/SLO

    • People who want to keep the reliability and agility of product
    development

    View Slide

  5. Site Reliability Engineering: Measuring and Managing Reliability
    https://www.coursera.org/learn/site-reliability-engineering-slos

    View Slide

  6. tl;dr
    • It is worth defining and reviewing SLI / SLO

    • But the SLI / SLO is not perfect from the beginning

    • Reduce cognitive load and introduce gradually to team

    View Slide

  7. Agenda
    • Learn SLO

    • What / Why / Where

    • Case Study in Quipper

    • Takeaways

    • Provide Recommended SLIs

    • Make the configuration as code

    • Have a steep learning curve

    View Slide

  8. Agenda
    • Learn SLO

    • What / Why / Where

    • Case Study in Quipper

    • Takeaways

    • Provide Recommended SLIs

    • Make the configuration as code

    • Have a steep learning curve

    View Slide

  9. What
    • SLI / Service Level Indicators

    • A quantifiable measure of service reliability

    • i.e. http success rate, response time

    • SLO / Service Level Objectives

    • Set a reliability target for an SLI

    • 99%, 99.9%, 99.99%…

    • Error Budget

    • An SLO implies an acceptable level of unreliability
    • This is a budget that can be allocated
    The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0

    View Slide

  10. SLI should be related to user happiness


    SLI(%)
    Good Event
    ——————————-
    Valid Event

    View Slide

  11. SLI should be related to user happiness


    SLI(%)
    http 2xx status count
    ———————————————————————————-——-
    http 2xx status count + 5xx status count

    View Slide

  12. SLO is a reliability target for an SLI


    SLI(%)
    SLO: 99.9%
    http 2xx status count
    ———————————————————————————-——-
    http 2xx status count + 5xx status count

    View Slide

  13. SLO is a reliability target for an SLI


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    10000 (2xx count)
    ———————————————————————————-——-
    10000 (2xx count) + 5 (5xx count)

    View Slide

  14. We can accept Errors as Error Budget


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    10000 (2xx count)
    ———————————————————————————-——-
    10000 (2xx count) + 5 (5xx count)
    Error Budget
    We can accept more 5
    count of 5xx error

    View Slide

  15. We can accept Errors as Error Budget


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    10000 (2xx count)
    ———————————————————————————-——-
    10000 (2xx count) + 5 (5xx count)
    Error Budget
    We can accept more 5
    count of 5xx error
    Event based SLO

    View Slide

  16. We can accept Errors as Error Budget


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    95 percentile Response time < 100msec
    In last 1 minutes
    ———————————————————————————-——-
    All time window

    View Slide

  17. We can accept Errors as Error Budget


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    95 percentile Response time < 100msec
    In last 1 minutes
    ———————————————————————————-——-
    All time window
    7 days
    Error Budget is only 10
    minutes in 7 days

    View Slide

  18. We can accept Errors as Error Budget


    SLI(%)
    SLO: 99.9%
    Present: 99.95%
    95 percentile Response time < 100msec
    In last 1 minutes
    ———————————————————————————-——-
    All time window
    7 days
    Error Budget is only 10
    minutes in 7 days
    Monitor based SLO

    View Slide

  19. Why
    • Fact-based decision making

    • Team can develop with a balance between reliability and agility

    • Especially important in the microserrvices architecture

    View Slide

  20. Team can develop with a balance between reliability and agility

    Reliability Agility
    Ops
    Keep the reliability
    Dev
    Let’s release new feature!
    SLO

    View Slide

  21. Especially important in the microserrvices architecture
    ServiceA
    ServiceB
    ServiceC
    Success Rate 99.9%
    Success Rate 99%
    Success Rate 99%
    Reliability depends on
    other services

    View Slide

  22. Where
    Synthetics Client
    Frontend
    CDN LoadBalancer Application DataStore
    Many options, Trade-off

    View Slide

  23. Where
    Synthetics Client
    Frontend
    CDN LoadBalancer Application DataStore
    Many options, Trade-off
    Some requests might
    not reach to the apps
    Need more
    engineering effort to
    generate E2E tests

    View Slide

  24. In Quipper
    Synthetics Client
    Frontend
    CDN LoadBalancer Application DataStore
    Send everything to Datadog

    View Slide

  25. Agenda
    • Learn SLO

    • What / Why / Where

    • Case Study in Quipper

    • Takeaways

    • Provide Recommended SLIs

    • Make the configuration as code

    • Have a steep learning curve

    View Slide

  26. Self-Contained
    “Encourage development teams to be self-contained so that each team can make products
    more comprehensively, proactively, and efficiently.”

    View Slide

  27. SRE Mission for 2020 / Self-Contained
    • Product Team can develop by themselves

    • No ask SREs

    • We SRE provides the process

    • Design Doc

    • Production Readiness Check

    • Delegate Infrastructure Management(Terraform)

    • SLI/SLO

    View Slide

  28. Timeline
    2019 2020
    Migrated to Kubernetes
    Define the Ownership
    Production Readiness Checklist
    SLO review by myself
    Set Error Budget Policy
    Jun.
    Mar. Mar.
    Sep.
    SRE NEXT
    SLO review with Devs

    View Slide

  29. Timeline
    2019 2020
    Migrated to Kubernetes
    Define the Ownership
    Production Readiness Checklist
    SLO review by myself
    SLO review with Devs
    Jun.
    Mar. Mar.
    Sep.
    SRE NEXT
    Set Error Budget Policy

    View Slide

  30. Timeline
    2019 2020
    Migrated to Kubernetes
    Define the Ownership
    Production Readiness Checklist
    SLO review by myself
    SLO review with Devs
    Jun.
    Mar. Mar.
    Sep.
    SRE NEXT
    Why do we need such steps?
    Set Error Budget Policy

    View Slide

  31. Why do we need such steps?
    • SLIs/SLOs we defined are appropriate?

    • If not, Error Budget Policy won’t work well

    • Can the product team start the process itself?

    • If not, need some scaffold, preparation, training

    View Slide

  32. Case Study in Quipper
    • Define the Ownership

    • SLO review by myself

    • SLO review with Devs

    • Set Error Budget Policy

    View Slide

  33. Case Study in Quipper
    • Define the Ownership

    • SLO review by myself

    • SLO review with Devs

    • Set Error Budget Policy

    View Slide

  34. Know your systems and organizations
    • 2 Product

    • 4 Branches


    • 97 Kubernetes Deployment

    • 84 Developers (Includes 6 SREs)

    • 48 subdomains
    Where is the Ownership?

    View Slide

  35. Define the Owner

    View Slide

  36. Define the Owner
    Services / Teams
    Japan 7 Global 8
    Philippines 3
    indonesia 4
    Shared 1

    View Slide

  37. Define Service Owner In Design Doc for new service

    View Slide

  38. Case Study in Quipper
    • Define the Ownership

    • SLO review by myself

    • SLO review with Devs

    • Set Error Budget Policy

    View Slide

  39. SLO review by myself
    • Establish SLO Review process

    • How to set SLO?

    • How to monitor SLO?

    • What is an action when SLO violation?

    • How to investigate?

    • Improve SLI / SLO accuracy

    • How to think to revise?

    View Slide

  40. How to set and monitor SLO?

    View Slide

  41. How to set and monitor SLO?
    • Unfortunately, there is no Alert or recording system

    • Use Slack reminder and record on Github Issue

    View Slide

  42. How to set and monitor SLO?

    View Slide

  43. Availability Table
    https://landing.google.com/sre/sre-book/chapters/availability-table/
    Too many errors
    Target too high
    Start with this!

    View Slide

  44. Realized that “SLO Review” is good habit
    • Good habit?

    • Like Pair-Programming or Unit Test

    • Why?

    • Motivate to get metrics

    • No burnout, feel relief

    • Aware of the factors that hinder reliability

    • Platform Outage

    • Push notification

    • Resource Capacity

    • Rolling Update

    View Slide

  45. Case Study in Quipper
    • Define the Ownership

    • SLO review by myself

    • SLO review with Devs

    • Set Error Budget Policy

    View Slide

  46. Many Problems…
    • Noisy metrics by dos detector

    • Developing SLIs

    • Send http path tag for shared service

    • No available metrics for microservices SLIs

    View Slide

  47. Dos Detector: Rate limiting by Reverse Proxy

    View Slide

  48. Dos Detector: Rate limiting by Reverse Proxy
    If a large number of requests
    are made from the same client
    in a short time, returns 503

    View Slide

  49. SLI should be related to user happiness


    SLI(%)
    http 2xx status count
    ———————————————————————————-——-
    http 2xx status count + 5xx status count

    View Slide

  50. Noisy metrics by dos detector

    View Slide

  51. Send http path tag for shared service
    Coaching Team uses
    example.quipper.com/coaching
    School Team uses
    example.quipper.com/school

    View Slide

  52. Send http path tag for shared service

    View Slide

  53. Send http path tag for shared service

    View Slide

  54. No available metrics for microservices SLIs

    View Slide

  55. No available metrics for microservices SLIs
    ServiceA
    ServiceB
    ServiceC
    GET http://serviceb
    GET http://servicec

    View Slide

  56. No available metrics for microservices SLIs
    ServiceA
    ServiceB
    ServiceC
    GET http://serviceb
    GET http://servicec
    Side-car container

    View Slide

  57. Case Study in Quipper
    • Define the Ownership

    • SLO review by myself

    • SLO review with Devs

    • Set Error Budget Policy

    • To be continued…

    View Slide

  58. Agenda
    • Learn SLO

    • What / Why / Where

    • Case Study in Quipper

    • Takeaways

    • Provide Recommended SLIs

    • Make the configuration as code

    • Have a steep learning curve

    View Slide

  59. Provide Standardized / Recommended SLIs
    • Ideally, better to set SLIs by Product Team but…

    • Start with default first

    View Slide

  60. SLI menu
    • Availability

    • http success rate

    • Latency

    • upstream response time < x msec

    View Slide

  61. Make the configuration as code

    View Slide

  62. Make the configuration as code
    Developer can easily
    change by pull request

    View Slide

  63. Have a steep learning curve

    View Slide

  64. Good Documentation

    View Slide

  65. Work together

    View Slide

  66. Agenda
    • Learn SLO

    • What / Why / Where

    • Case Study in Quipper

    • Takeaways

    • Provide Recommended SLIs

    • Make the configuration as code

    • Have a steep learning curve

    View Slide

  67. Summery
    • It is worth defining and reviewing SLI / SLO

    • But the SLI / SLO is not perfect from the beginning

    • Reduce cognitive load and introduce gradually to team

    View Slide

  68. Thank You!
    chaspy
    chaspy_
    Site Reliability Engineer

    at Quipper
    Takeshi Kondo
    SRE Lounge Terraform-jp

    View Slide