Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started with SLO

Buzzvil
March 10, 2021

Getting Started with SLO

By Liam

Buzzvil

March 10, 2021
Tweet

More Decks by Buzzvil

Other Decks in Programming

Transcript

  1. Copyright ⓒ All Right Reserved by Buzzvil Overwhelming Alerts Throughout

    the workday, we receive tons of threshold-based alerts. CPU utilization alerts, high # of DB connections, even forecast & anomaly detections. Are these all worth checking out?
  2. Copyright ⓒ All Right Reserved by Buzzvil • Can we

    measure if our service is working? • Can we get notified when it’s not working? What defines “working”? Does this mean something’s broken? 🤔
  3. Copyright ⓒ All Right Reserved by Buzzvil Thinking of our

    users • End Users ◦ (Buzzvil’s own app or SDK users) ◦ No problem with ad participation ◦ Reward is given as expected • Advertisers / Agencies ◦ Ad performance(ROAS, CVR, CTR, …) ◦ Budget is well-spent • Publishers ◦ Monetizing their apps ◦ Worried about UX What defines “working”? ↖Not working! Obviously..
  4. Copyright ⓒ All Right Reserved by Buzzvil Can we calculate

    revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 • Unable to serve ad request ◦ Direct revenue lost ◦ Cost of communication ◦ We might have to compensate our advertisers for the disruption(service credit, refund, ..) • Publishers will complain ◦ Cost of communication ◦ Potential risk to partnership itself • User happiness is decreased ◦ Increased CS ◦ Unhappy users might drop out
  5. Copyright ⓒ All Right Reserved by Buzzvil Can we calculate

    revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 • Unable to serve ad request ◦ Direct revenue lost ◦ Cost of communication ◦ We might have to compensate our advertisers for the disruption(service credit, refund, ..) • Publishers will complain ◦ Cost of communication ◦ Potential risk to partnership itself • User happiness is decreased ◦ Increased CS ◦ Unhappy users might drop out Lot of hidden cost in addition to direct revenue lost(operational cost, losing business opportunity, …) Direct threat to success of our product!
  6. Copyright ⓒ All Right Reserved by Buzzvil • a measurement

    that is determined over a metric, or a piece of data representing some property of a service • most often useful if it can result in a binary “good” or “bad” outcome for each event • Some examples: ◦ RPC is successful ◦ p95(95% of request) latency is below 500ms SLI(Service Level Indicator) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo
  7. Copyright ⓒ All Right Reserved by Buzzvil SLO(Service Level Objective)

    The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo • “proper level of reliability” targeted by the service • 99%, 99.9%, 99.95%, 99.99%, ... (good + bad) events good events Availability = total minutes good minutes Availability = 😔 😄 SLO SLI
  8. Copyright ⓒ All Right Reserved by Buzzvil Error Budget The

    Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo • Product & Engineering defines SLO targets • [100% - SLO target]: “budget of unreliability” • We can have control loop for utilizing budget SLO Target Yearly allowed downtime Monthly allowed downtime 99.99% uptime 52 min, 35 sec 4 min, 23 sec 99.95% uptime 4 hours, 22 min, 48 sec 21 min, 54 sec 99.9% uptime 8 hours, 45 min, 57 sec 43 min, 50 sec 99.5% uptime 43 hours, 49 min, 45 sec 3 hours, 39 min 99% uptime 87 hours, 39 min 7 hours, 18 min 😔 😄 SLO SLI Current Budget
  9. Copyright ⓒ All Right Reserved by Buzzvil • Availability ◦

    Proportion of valid requests processed successfully. • Latency ◦ Proportion of valid requests served faster than a threshold. Typical RPC SLIs(HTTP JSON API, gRPC, …) Request/Response
  10. Copyright ⓒ All Right Reserved by Buzzvil • Freshness ◦

    The proportion of valid data updated more recently than a threshold. • Coverage ◦ The proportion of valid data processed successfully. Typical SLIs for data processing jobs(i.e. Report Generator, ML Training Pipeline, ...) Data Processing
  11. Copyright ⓒ All Right Reserved by Buzzvil Simple is the

    best Choosing SLIs Pick 1~3 SLIs per user story
  12. Copyright ⓒ All Right Reserved by Buzzvil SLI/SLO should evolve

    over time Choosing SLIs • We can’t pick perfect SLI/SLO from the beginning. ◦ Pick simple and typical SLIs first ◦ For choosing threshold, we can always refer to historical metric ◦ After running for some amount of time, we’ll get feedback from variety of sources(users, other teams, business, ...) • Merge user stories when possible(except for some critical paths) • Don’t spend too much time choosing threshold. Group user stories into some buckets. ◦ i.e. RPC latency ▪ interactive - 500ms ▪ background - 5s ▪ write - 1.5s
  13. Copyright ⓒ All Right Reserved by Buzzvil Example - Pixelsvc

    • SDK ◦ Percentage of HTTP GET requests for /buzzvil-pixel.js with 200~399 status measured at CDN • Track Event ◦ Percentage of HTTP GET requests for /track with 200~399 status measured at load balancer | APM • Data Processing ◦ Percentage of tracked events successfully processed within 1 hour measured at datastore
  14. Copyright ⓒ All Right Reserved by Buzzvil • 레이턴시가 500ms

    정도를 유지하던 서비스가 있는데, 합리적인 범위인지 판단을 해야함. 만약 유저가 느끼기에 평소에 500ms 이내에 꾸준히 반응했으면 어느날 1s의 latency가 생기면 성능이 저하되었다고 느낄 것. ◦ 시간이 지남에 따라 부하 증가(디비 병목이거나 사용량이 증가해서 서버 saturation이 높아졌을 때) 로 인해 latency가 점진적으로 증가하면 우리는 모니터링만 보고서는 이를 알아채기가 어렵다. ◦ 500 ms가 어떻게 정해졌건, SLI화 해두면 error budget이 감소할 것이기에 우린 알아챌 수 있음. x Product Teams
  15. Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI..

    Product Teams • What if latency of a service gradually increases over time? ◦ Bad autoscaling configuration ◦ User acquisition ◦ Due to increased datastore load • At some day users will notice the degradation.(1s -> 3s)
  16. Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI..

    Product Teams Ops Product / Engineering 🛑 Reliability first! 🚀 Velocity first! 💦 friction...
  17. Copyright ⓒ All Right Reserved by Buzzvil Utilizing Error Budget

    99.9% SLI 😄 99.95% 99% 🚨 Danger zone. Freeze, and improve reliability 🔬 Do experiments, Feature release, ... 󰳕 Monitoring, Some of maintenance
  18. Copyright ⓒ All Right Reserved by Buzzvil Let’s push every

    events to datadog and create SLO metrics! • AWS metrics, Sentry exceptions, We’ve got nice tools
  19. Copyright ⓒ All Right Reserved by Buzzvil • Design Doc

    - needs some improvements • Production readiness checklist ◦ CI/CD ◦ helm chart, templates, pipeline with some of nice defaults to start with ◦ Graceful shutdown ◦ Monitoring(APM, Sentry) • Infrastructure management(RDS, Dynamodb, …) - with help from DevOps • Define SLI/SLO Some useful practice we already have Product Teams
  20. Copyright ⓒ All Right Reserved by Buzzvil • Error budget에

    영향을 미친 요인은 무엇인지 ◦ Datastore의 로드 ◦ 잘못된 autoscaling 설정 ◦ 리소스 할당이 부족 ◦ Lack of graceful termination → deployment might consume error budget! ◦ Spike pattern in services(push notification) ▪ 여기서 잠깐, 우리는 퍼블리셔의 프로모션을 미리 알고 대응할 수 없다는 제약이 있다. • 사업단과 논의하여 대규모 프로모션에 대한 정보를 미리 습득할 수 있어야 함 ▪ 제품팀에서 알고있는 릴리즈 일정이 있다면 이는 미리 수요 예측에 반영되어 있어야 한다. • Product Teams
  21. Copyright ⓒ All Right Reserved by Buzzvil • Google SRE

    Book • Implementing Service Level Objectives • https://www.slideshare.net/Pivotal/six-simple-steps-to-service-level-obje ctives-slos • Solving reliability fears with service level objectives (Google) Useful Resources