Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SLO Review

SLO Review

2020-01-25 SRE NEXT 2020
https://sre-next.dev/schedule/#c4

Takeshi Kondo

January 25, 2020
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

  1. Questions • ✋Do you know the meaning of SLO? •

    ✋Do you define SLO for your service? • ✋Do you have an Error Budget Policy for your service?
  2. Target • People who want to know SLI/SLO • People

    who want to know how to use SLI/SLO • People who want to keep the reliability and agility of product development
  3. tl;dr • It is worth defining and reviewing SLI /

    SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team
  4. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  5. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  6. What • SLI / Service Level Indicators • A quantifiable

    measure of service reliability • i.e. http success rate, response time • SLO / Service Level Objectives • Set a reliability target for an SLI • 99%, 99.9%, 99.99%… • Error Budget • An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
  7. SLI should be related to user happiness SLI(%) Good Event

    ——————————- Valid Event
  8. SLI should be related to user happiness SLI(%) http 2xx

    status count ———————————————————————————-——- http 2xx status count + 5xx status count
  9. SLO is a reliability target for an SLI SLI(%) SLO:

    99.9% http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count
  10. SLO is a reliability target for an SLI SLI(%) SLO:

    99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count)
  11. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error
  12. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error Event based SLO
  13. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window
  14. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days
  15. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO
  16. Why • Fact-based decision making • Team can develop with

    a balance between reliability and agility • Especially important in the microserrvices architecture
  17. Team can develop with a balance between reliability and agility

    Reliability Agility Ops Keep the reliability Dev Let’s release new feature! SLO
  18. Especially important in the microserrvices architecture ServiceA ServiceB ServiceC Success

    Rate 99.9% Success Rate 99% Success Rate 99% Reliability depends on other services
  19. Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options,

    Trade-off Some requests might not reach to the apps Need more engineering effort to generate E2E tests
  20. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  21. Self-Contained “Encourage development teams to be self-contained so that each

    team can make products more comprehensively, proactively, and efficiently.”
  22. SRE Mission for 2020 / Self-Contained • Product Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO
  23. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself Set Error Budget Policy Jun. Mar. Mar. Sep. SRE NEXT SLO review with Devs
  24. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Set Error Budget Policy
  25. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Why do we need such steps? Set Error Budget Policy
  26. Why do we need such steps? • SLIs/SLOs we defined

    are appropriate? • If not, Error Budget Policy won’t work well • Can the product team start the process itself? • If not, need some scaffold, preparation, training
  27. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  28. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  29. Know your systems and organizations • 2 Product • 4

    Branches • 97 Kubernetes Deployment • 84 Developers (Includes 6 SREs) • 48 subdomains Where is the Ownership?
  30. Define the Owner Services / Teams Japan 7 Global 8

    Philippines 3 indonesia 4 Shared 1
  31. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  32. SLO review by myself • Establish SLO Review process •

    How to set SLO? • How to monitor SLO? • What is an action when SLO violation? • How to investigate? • Improve SLI / SLO accuracy • How to think to revise?
  33. How to set and monitor SLO? • Unfortunately, there is

    no Alert or recording system • Use Slack reminder and record on Github Issue
  34. Realized that “SLO Review” is good habit • Good habit?

    • Like Pair-Programming or Unit Test • Why? • Motivate to get metrics • No burnout, feel relief • Aware of the factors that hinder reliability • Platform Outage • Push notification • Resource Capacity • Rolling Update
  35. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  36. Many Problems… • Noisy metrics by dos detector • Developing

    SLIs • Send http path tag for shared service • No available metrics for microservices SLIs
  37. Dos Detector: Rate limiting by Reverse Proxy If a large

    number of requests are made from the same client in a short time, returns 503
  38. SLI should be related to user happiness SLI(%) http 2xx

    status count ———————————————————————————-——- http 2xx status count + 5xx status count
  39. Send http path tag for shared service Coaching Team uses

    example.quipper.com/coaching School Team uses example.quipper.com/school
  40. No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET

    http://serviceb GET http://servicec Side-car container
  41. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy • To be continued…
  42. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  43. Provide Standardized / Recommended SLIs • Ideally, better to set

    SLIs by Product Team but… • Start with default first
  44. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  45. Summery • It is worth defining and reviewing SLI /

    SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team