Getting Started with SLO

by Buzzvil

Embed

Start on current slide

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Copyright ⓒ All Right Reserved by Buzzvil Overwhelming Alerts Throughout the workday, we receive tons of threshold-based alerts. CPU utilization alerts, high # of DB connections, even forecast & anomaly detections. Are these all worth checking out?

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Copyright ⓒ All Right Reserved by Buzzvil Thinking of our users ● End Users ○ (Buzzvil’s own app or SDK users) ○ No problem with ad participation ○ Reward is given as expected ● Advertisers / Agencies ○ Ad performance(ROAS, CVR, CTR, …) ○ Budget is well-spent ● Publishers ○ Monetizing their apps ○ Worried about UX What deﬁnes “working”? ↖Not working! Obviously..

Slide 5

Slide 5 text

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 ● Unable to serve ad request ○ Direct revenue lost ○ Cost of communication ○ We might have to compensate our advertisers for the disruption(service credit, refund, ..) ● Publishers will complain ○ Cost of communication ○ Potential risk to partnership itself ● User happiness is decreased ○ Increased CS ○ Unhappy users might drop out

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Copyright ⓒ All Right Reserved by Buzzvil ● a measurement that is determined over a metric, or a piece of data representing some property of a service ● most often useful if it can result in a binary “good” or “bad” outcome for each event ● Some examples: ○ RPC is successful ○ p95(95% of request) latency is below 500ms SLI(Service Level Indicator) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo

Slide 8

Slide 8 text

Copyright ⓒ All Right Reserved by Buzzvil SLO(Service Level Objective) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo ● “proper level of reliability” targeted by the service ● 99%, 99.9%, 99.95%, 99.99%, ... (good + bad) events good events Availability = total minutes good minutes Availability = 😔 😄 SLO SLI

Slide 9

Slide 9 text

Copyright ⓒ All Right Reserved by Buzzvil Error Budget The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo ● Product & Engineering deﬁnes SLO targets ● [100% - SLO target]: “budget of unreliability” ● We can have control loop for utilizing budget SLO Target Yearly allowed downtime Monthly allowed downtime 99.99% uptime 52 min, 35 sec 4 min, 23 sec 99.95% uptime 4 hours, 22 min, 48 sec 21 min, 54 sec 99.9% uptime 8 hours, 45 min, 57 sec 43 min, 50 sec 99.5% uptime 43 hours, 49 min, 45 sec 3 hours, 39 min 99% uptime 87 hours, 39 min 7 hours, 18 min 😔 😄 SLO SLI Current Budget

Slide 10

Slide 10 text

Copyright ⓒ All Right Reserved by Buzzvil ● Availability ○ Proportion of valid requests processed successfully. ● Latency ○ Proportion of valid requests served faster than a threshold. Typical RPC SLIs(HTTP JSON API, gRPC, …) Request/Response

Slide 11

Slide 11 text

Copyright ⓒ All Right Reserved by Buzzvil ● Freshness ○ The proportion of valid data updated more recently than a threshold. ● Coverage ○ The proportion of valid data processed successfully. Typical SLIs for data processing jobs(i.e. Report Generator, ML Training Pipeline, ...) Data Processing

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Copyright ⓒ All Right Reserved by Buzzvil SLI/SLO should evolve over time Choosing SLIs ● We can’t pick perfect SLI/SLO from the beginning. ○ Pick simple and typical SLIs ﬁrst ○ For choosing threshold, we can always refer to historical metric ○ After running for some amount of time, we’ll get feedback from variety of sources(users, other teams, business, ...) ● Merge user stories when possible(except for some critical paths) ● Don’t spend too much time choosing threshold. Group user stories into some buckets. ○ i.e. RPC latency ■ interactive - 500ms ■ background - 5s ■ write - 1.5s

Slide 14

Slide 14 text

Copyright ⓒ All Right Reserved by Buzzvil Example - Pixelsvc ● SDK ○ Percentage of HTTP GET requests for /buzzvil-pixel.js with 200~399 status measured at CDN ● Track Event ○ Percentage of HTTP GET requests for /track with 200~399 status measured at load balancer | APM ● Data Processing ○ Percentage of tracked events successfully processed within 1 hour measured at datastore

Slide 15

Slide 15 text

Copyright ⓒ All Right Reserved by Buzzvil ● 레이턴시가 500ms 정도를 유지하던 서비스가 있는데, 합리적인 범위인지 판단을 해야함. 만약 유저가 느끼기에 평소에 500ms 이내에 꾸준히 반응했으면 어느날 1s의 latency가 생기면 성능이 저하되었다고 느낄 것. ○ 시간이 지남에 따라 부하 증가(디비 병목이거나 사용량이 증가해서 서버 saturation이 높아졌을 때) 로 인해 latency가 점진적으로 증가하면 우리는 모니터링만 보고서는 이를 알아채기가 어렵다. ○ 500 ms가 어떻게 정해졌건, SLI화 해두면 error budget이 감소할 것이기에 우린 알아챌 수 있음. x Product Teams

Slide 16

Slide 16 text

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI.. Product Teams ● What if latency of a service gradually increases over time? ○ Bad autoscaling configuration ○ User acquisition ○ Due to increased datastore load ● At some day users will notice the degradation.(1s -> 3s)

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Copyright ⓒ All Right Reserved by Buzzvil Utilizing Error Budget 99.9% SLI 😄 99.95% 99% 🚨 Danger zone. Freeze, and improve reliability 🔬 Do experiments, Feature release, ... 󰳕 Monitoring, Some of maintenance

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Copyright ⓒ All Right Reserved by Buzzvil ● Design Doc - needs some improvements ● Production readiness checklist ○ CI/CD ○ helm chart, templates, pipeline with some of nice defaults to start with ○ Graceful shutdown ○ Monitoring(APM, Sentry) ● Infrastructure management(RDS, Dynamodb, …) - with help from DevOps ● Deﬁne SLI/SLO Some useful practice we already have Product Teams

Slide 21

Slide 21 text

Copyright ⓒ All Right Reserved by Buzzvil ● Error budget에 영향을 미친 요인은 무엇인지 ○ Datastore의 로드 ○ 잘못된 autoscaling 설정 ○ 리소스 할당이 부족 ○ Lack of graceful termination → deployment might consume error budget! ○ Spike pattern in services(push notification) ■ 여기서 잠깐, 우리는 퍼블리셔의 프로모션을 미리 알고 대응할 수 없다는 제약이 있다. ● 사업단과 논의하여 대규모 프로모션에 대한 정보를 미리 습득할 수 있어야 함 ■ 제품팀에서 알고있는 릴리즈 일정이 있다면 이는 미리 수요 예측에 반영되어 있어야 한다. ● Product Teams

Slide 22

Slide 22 text

Copyright ⓒ All Right Reserved by Buzzvil ● Google SRE Book ● Implementing Service Level Objectives ● https://www.slideshare.net/Pivotal/six-simple-steps-to-service-level-obje ctives-slos ● Solving reliability fears with service level objectives (Google) Useful Resources