SLO Review

SLO Review

2020-01-25 SRE NEXT 2020
https://sre-next.dev/schedule/#c4

93c80c388fe9d8f9df7d030549a0ff0b?s=128

Takeshi Kondo

January 25, 2020
Tweet

Transcript

  1. SLO Review Takeshi Kondo / @chaspy 2020/01/25 SRE NEXT 2020

    #srenext #srenextC
  2. Service Level Objectives

  3. Questions • ✋Do you know the meaning of SLO? •

    ✋Do you define SLO for your service? • ✋Do you have an Error Budget Policy for your service?
  4. Target • People who want to know SLI/SLO • People

    who want to know how to use SLI/SLO • People who want to keep the reliability and agility of product development
  5. Site Reliability Engineering: Measuring and Managing Reliability https://www.coursera.org/learn/site-reliability-engineering-slos

  6. tl;dr • It is worth defining and reviewing SLI /

    SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team
  7. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  8. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  9. What • SLI / Service Level Indicators • A quantifiable

    measure of service reliability • i.e. http success rate, response time • SLO / Service Level Objectives • Set a reliability target for an SLI • 99%, 99.9%, 99.99%… • Error Budget • An SLO implies an acceptable level of unreliability • This is a budget that can be allocated The Art of SLOs – Slides / https://docs.google.com/presentation/d/1qcQ6alG_qUg3qWf733ZsDnTggwzqe4PZICrFXZ1zQZs/edit#slide=id.g75945b48fe_0_0
  10. SLI should be related to user happiness SLI(%) Good Event

    ——————————- Valid Event
  11. SLI should be related to user happiness SLI(%) http 2xx

    status count ———————————————————————————-——- http 2xx status count + 5xx status count
  12. SLO is a reliability target for an SLI SLI(%) SLO:

    99.9% http 2xx status count ———————————————————————————-——- http 2xx status count + 5xx status count
  13. SLO is a reliability target for an SLI SLI(%) SLO:

    99.9% Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count)
  14. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error
  15. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 10000 (2xx count) ———————————————————————————-——- 10000 (2xx count) + 5 (5xx count) Error Budget We can accept more 5 count of 5xx error Event based SLO
  16. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window
  17. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days
  18. We can accept Errors as Error Budget SLI(%) SLO: 99.9%

    Present: 99.95% 95 percentile Response time < 100msec In last 1 minutes ———————————————————————————-——- All time window 7 days Error Budget is only 10 minutes in 7 days Monitor based SLO
  19. Why • Fact-based decision making • Team can develop with

    a balance between reliability and agility • Especially important in the microserrvices architecture
  20. Team can develop with a balance between reliability and agility

    Reliability Agility Ops Keep the reliability Dev Let’s release new feature! SLO
  21. Especially important in the microserrvices architecture ServiceA ServiceB ServiceC Success

    Rate 99.9% Success Rate 99% Success Rate 99% Reliability depends on other services
  22. Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options,

    Trade-off
  23. Where Synthetics Client Frontend CDN LoadBalancer Application DataStore Many options,

    Trade-off Some requests might not reach to the apps Need more engineering effort to generate E2E tests
  24. In Quipper Synthetics Client Frontend CDN LoadBalancer Application DataStore Send

    everything to Datadog
  25. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  26. Self-Contained “Encourage development teams to be self-contained so that each

    team can make products more comprehensively, proactively, and efficiently.”
  27. SRE Mission for 2020 / Self-Contained • Product Team can

    develop by themselves • No ask SREs • We SRE provides the process • Design Doc • Production Readiness Check • Delegate Infrastructure Management(Terraform) • SLI/SLO
  28. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself Set Error Budget Policy Jun. Mar. Mar. Sep. SRE NEXT SLO review with Devs
  29. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Set Error Budget Policy
  30. Timeline 2019 2020 Migrated to Kubernetes Define the Ownership Production

    Readiness Checklist SLO review by myself SLO review with Devs Jun. Mar. Mar. Sep. SRE NEXT Why do we need such steps? Set Error Budget Policy
  31. Why do we need such steps? • SLIs/SLOs we defined

    are appropriate? • If not, Error Budget Policy won’t work well • Can the product team start the process itself? • If not, need some scaffold, preparation, training
  32. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  33. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  34. Know your systems and organizations • 2 Product • 4

    Branches • 97 Kubernetes Deployment • 84 Developers (Includes 6 SREs) • 48 subdomains Where is the Ownership?
  35. Define the Owner

  36. Define the Owner Services / Teams Japan 7 Global 8

    Philippines 3 indonesia 4 Shared 1
  37. Define Service Owner In Design Doc for new service

  38. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  39. SLO review by myself • Establish SLO Review process •

    How to set SLO? • How to monitor SLO? • What is an action when SLO violation? • How to investigate? • Improve SLI / SLO accuracy • How to think to revise?
  40. How to set and monitor SLO?

  41. How to set and monitor SLO? • Unfortunately, there is

    no Alert or recording system • Use Slack reminder and record on Github Issue
  42. How to set and monitor SLO?

  43. Availability Table https://landing.google.com/sre/sre-book/chapters/availability-table/ Too many errors Target too high Start

    with this!
  44. Realized that “SLO Review” is good habit • Good habit?

    • Like Pair-Programming or Unit Test • Why? • Motivate to get metrics • No burnout, feel relief • Aware of the factors that hinder reliability • Platform Outage • Push notification • Resource Capacity • Rolling Update
  45. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy
  46. Many Problems… • Noisy metrics by dos detector • Developing

    SLIs • Send http path tag for shared service • No available metrics for microservices SLIs
  47. Dos Detector: Rate limiting by Reverse Proxy

  48. Dos Detector: Rate limiting by Reverse Proxy If a large

    number of requests are made from the same client in a short time, returns 503
  49. SLI should be related to user happiness SLI(%) http 2xx

    status count ———————————————————————————-——- http 2xx status count + 5xx status count
  50. Noisy metrics by dos detector

  51. Send http path tag for shared service Coaching Team uses

    example.quipper.com/coaching School Team uses example.quipper.com/school
  52. Send http path tag for shared service

  53. Send http path tag for shared service

  54. No available metrics for microservices SLIs

  55. No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET

    http://serviceb GET http://servicec
  56. No available metrics for microservices SLIs ServiceA ServiceB ServiceC GET

    http://serviceb GET http://servicec Side-car container
  57. Case Study in Quipper • Define the Ownership • SLO

    review by myself • SLO review with Devs • Set Error Budget Policy • To be continued…
  58. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  59. Provide Standardized / Recommended SLIs • Ideally, better to set

    SLIs by Product Team but… • Start with default first
  60. SLI menu • Availability • http success rate • Latency

    • upstream response time < x msec
  61. Make the configuration as code

  62. Make the configuration as code Developer can easily change by

    pull request
  63. Have a steep learning curve

  64. Good Documentation

  65. Work together

  66. Agenda • Learn SLO • What / Why / Where

    • Case Study in Quipper • Takeaways • Provide Recommended SLIs • Make the configuration as code • Have a steep learning curve
  67. Summery • It is worth defining and reviewing SLI /

    SLO • But the SLI / SLO is not perfect from the beginning • Reduce cognitive load and introduce gradually to team
  68. Thank You! chaspy chaspy_ Site Reliability Engineer at Quipper Takeshi

    Kondo SRE Lounge Terraform-jp