Getting Started with SLO

Copyright ⓒ All Right Reserved by Buzzvil Liam Hwang 2021.03.10
Getting Started with SLO

Copyright ⓒ All Right Reserved by Buzzvil Overwhelming Alerts Throughout
the workday, we receive tons of threshold-based alerts. CPU utilization alerts, high # of DB connections, even forecast & anomaly detections. Are these all worth checking out?

Copyright ⓒ All Right Reserved by Buzzvil • Can we
measure if our service is working? • Can we get notiﬁed when it’s not working? What deﬁnes “working”? Does this mean something’s broken? 🤔

Copyright ⓒ All Right Reserved by Buzzvil Thinking of our
users • End Users ◦ (Buzzvil’s own app or SDK users) ◦ No problem with ad participation ◦ Reward is given as expected • Advertisers / Agencies ◦ Ad performance(ROAS, CVR, CTR, …) ◦ Budget is well-spent • Publishers ◦ Monetizing their apps ◦ Worried about UX What deﬁnes “working”? ↖Not working! Obviously..

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate
revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 • Unable to serve ad request ◦ Direct revenue lost ◦ Cost of communication ◦ We might have to compensate our advertisers for the disruption(service credit, refund, ..) • Publishers will complain ◦ Cost of communication ◦ Potential risk to partnership itself • User happiness is decreased ◦ Increased CS ◦ Unhappy users might drop out

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate
revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 • Unable to serve ad request ◦ Direct revenue lost ◦ Cost of communication ◦ We might have to compensate our advertisers for the disruption(service credit, refund, ..) • Publishers will complain ◦ Cost of communication ◦ Potential risk to partnership itself • User happiness is decreased ◦ Increased CS ◦ Unhappy users might drop out Lot of hidden cost in addition to direct revenue lost(operational cost, losing business opportunity, …) Direct threat to success of our product!

Copyright ⓒ All Right Reserved by Buzzvil • a measurement
that is determined over a metric, or a piece of data representing some property of a service • most often useful if it can result in a binary “good” or “bad” outcome for each event • Some examples: ◦ RPC is successful ◦ p95(95% of request) latency is below 500ms SLI(Service Level Indicator) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo

Copyright ⓒ All Right Reserved by Buzzvil SLO(Service Level Objective)
The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo • “proper level of reliability” targeted by the service • 99%, 99.9%, 99.95%, 99.99%, ... (good + bad) events good events Availability = total minutes good minutes Availability = 😔 😄 SLO SLI

Copyright ⓒ All Right Reserved by Buzzvil Error Budget The
Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo • Product & Engineering deﬁnes SLO targets • [100% - SLO target]: “budget of unreliability” • We can have control loop for utilizing budget SLO Target Yearly allowed downtime Monthly allowed downtime 99.99% uptime 52 min, 35 sec 4 min, 23 sec 99.95% uptime 4 hours, 22 min, 48 sec 21 min, 54 sec 99.9% uptime 8 hours, 45 min, 57 sec 43 min, 50 sec 99.5% uptime 43 hours, 49 min, 45 sec 3 hours, 39 min 99% uptime 87 hours, 39 min 7 hours, 18 min 😔 😄 SLO SLI Current Budget

Copyright ⓒ All Right Reserved by Buzzvil • Availability ◦
Proportion of valid requests processed successfully. • Latency ◦ Proportion of valid requests served faster than a threshold. Typical RPC SLIs(HTTP JSON API, gRPC, …) Request/Response

Copyright ⓒ All Right Reserved by Buzzvil • Freshness ◦
The proportion of valid data updated more recently than a threshold. • Coverage ◦ The proportion of valid data processed successfully. Typical SLIs for data processing jobs(i.e. Report Generator, ML Training Pipeline, ...) Data Processing

Copyright ⓒ All Right Reserved by Buzzvil Simple is the
best Choosing SLIs Pick 1~3 SLIs per user story

Copyright ⓒ All Right Reserved by Buzzvil SLI/SLO should evolve
over time Choosing SLIs • We can’t pick perfect SLI/SLO from the beginning. ◦ Pick simple and typical SLIs ﬁrst ◦ For choosing threshold, we can always refer to historical metric ◦ After running for some amount of time, we’ll get feedback from variety of sources(users, other teams, business, ...) • Merge user stories when possible(except for some critical paths) • Don’t spend too much time choosing threshold. Group user stories into some buckets. ◦ i.e. RPC latency ▪ interactive - 500ms ▪ background - 5s ▪ write - 1.5s

Copyright ⓒ All Right Reserved by Buzzvil Example - Pixelsvc
• SDK ◦ Percentage of HTTP GET requests for /buzzvil-pixel.js with 200~399 status measured at CDN • Track Event ◦ Percentage of HTTP GET requests for /track with 200~399 status measured at load balancer | APM • Data Processing ◦ Percentage of tracked events successfully processed within 1 hour measured at datastore

Copyright ⓒ All Right Reserved by Buzzvil • 레이턴시가 500ms
정도를 유지하던 서비스가 있는데, 합리적인 범위인지 판단을 해야함. 만약 유저가 느끼기에 평소에 500ms 이내에 꾸준히 반응했으면 어느날 1s의 latency가 생기면 성능이 저하되었다고 느낄 것. ◦ 시간이 지남에 따라 부하 증가(디비 병목이거나 사용량이 증가해서 서버 saturation이 높아졌을 때) 로 인해 latency가 점진적으로 증가하면 우리는 모니터링만 보고서는 이를 알아채기가 어렵다. ◦ 500 ms가 어떻게 정해졌건, SLI화 해두면 error budget이 감소할 것이기에 우린 알아챌 수 있음. x Product Teams

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI..
Product Teams • What if latency of a service gradually increases over time? ◦ Bad autoscaling configuration ◦ User acquisition ◦ Due to increased datastore load • At some day users will notice the degradation.(1s -> 3s)

Copyright ⓒ All Right Reserved by Buzzvil Utilizing Error Budget
99.9% SLI 😄 99.95% 99% 🚨 Danger zone. Freeze, and improve reliability 🔬 Do experiments, Feature release, ... 󰳕 Monitoring, Some of maintenance

Copyright ⓒ All Right Reserved by Buzzvil • Design Doc
- needs some improvements • Production readiness checklist ◦ CI/CD ◦ helm chart, templates, pipeline with some of nice defaults to start with ◦ Graceful shutdown ◦ Monitoring(APM, Sentry) • Infrastructure management(RDS, Dynamodb, …) - with help from DevOps • Deﬁne SLI/SLO Some useful practice we already have Product Teams

Copyright ⓒ All Right Reserved by Buzzvil • Error budget에
영향을 미친 요인은 무엇인지 ◦ Datastore의 로드 ◦ 잘못된 autoscaling 설정 ◦ 리소스 할당이 부족 ◦ Lack of graceful termination → deployment might consume error budget! ◦ Spike pattern in services(push notification) ▪ 여기서 잠깐, 우리는 퍼블리셔의 프로모션을 미리 알고 대응할 수 없다는 제약이 있다. • 사업단과 논의하여 대규모 프로모션에 대한 정보를 미리 습득할 수 있어야 함 ▪ 제품팀에서 알고있는 릴리즈 일정이 있다면 이는 미리 수요 예측에 반영되어 있어야 한다. • Product Teams

Copyright ⓒ All Right Reserved by Buzzvil • Google SRE
Book • Implementing Service Level Objectives • https://www.slideshare.net/Pivotal/six-simple-steps-to-service-level-obje ctives-slos • Solving reliability fears with service level objectives (Google) Useful Resources

Getting Started with SLO

Getting Started with SLO

Buzzvil

More Decks by Buzzvil

Other Decks in Programming

Featured

Transcript

Copyright ⓒ All Right Reserved by Buzzvil Liam Hwang 2021.03.10

Copyright ⓒ All Right Reserved by Buzzvil Overwhelming Alerts Throughout

Copyright ⓒ All Right Reserved by Buzzvil • Can we

Copyright ⓒ All Right Reserved by Buzzvil Thinking of our

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate

Copyright ⓒ All Right Reserved by Buzzvil • a measurement

Copyright ⓒ All Right Reserved by Buzzvil SLO(Service Level Objective)

Copyright ⓒ All Right Reserved by Buzzvil Error Budget The

Copyright ⓒ All Right Reserved by Buzzvil • Availability ◦

Copyright ⓒ All Right Reserved by Buzzvil • Freshness ◦

Copyright ⓒ All Right Reserved by Buzzvil Simple is the

Copyright ⓒ All Right Reserved by Buzzvil SLI/SLO should evolve

Copyright ⓒ All Right Reserved by Buzzvil Example - Pixelsvc

Copyright ⓒ All Right Reserved by Buzzvil • 레이턴시가 500ms

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI..

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI..

Copyright ⓒ All Right Reserved by Buzzvil Utilizing Error Budget

Copyright ⓒ All Right Reserved by Buzzvil Let’s push every

Copyright ⓒ All Right Reserved by Buzzvil • Design Doc

Copyright ⓒ All Right Reserved by Buzzvil • Error budget에

Copyright ⓒ All Right Reserved by Buzzvil • Google SRE

Copyright ⓒ All Right Reserved by Buzzvil Thank you