Slide 1

Slide 1 text

Copyright ⓒ All Right Reserved by Buzzvil Liam Hwang 2021.03.10 Getting Started with SLO

Slide 2

Slide 2 text

Copyright ⓒ All Right Reserved by Buzzvil Overwhelming Alerts Throughout the workday, we receive tons of threshold-based alerts. CPU utilization alerts, high # of DB connections, even forecast & anomaly detections. Are these all worth checking out?

Slide 3

Slide 3 text

Copyright ⓒ All Right Reserved by Buzzvil ● Can we measure if our service is working? ● Can we get notified when it’s not working? What defines “working”? Does this mean something’s broken? 🤔

Slide 4

Slide 4 text

Copyright ⓒ All Right Reserved by Buzzvil Thinking of our users ● End Users ○ (Buzzvil’s own app or SDK users) ○ No problem with ad participation ○ Reward is given as expected ● Advertisers / Agencies ○ Ad performance(ROAS, CVR, CTR, …) ○ Budget is well-spent ● Publishers ○ Monetizing their apps ○ Worried about UX What defines “working”? ↖Not working! Obviously..

Slide 5

Slide 5 text

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 ● Unable to serve ad request ○ Direct revenue lost ○ Cost of communication ○ We might have to compensate our advertisers for the disruption(service credit, refund, ..) ● Publishers will complain ○ Cost of communication ○ Potential risk to partnership itself ● User happiness is decreased ○ Increased CS ○ Unhappy users might drop out

Slide 6

Slide 6 text

Copyright ⓒ All Right Reserved by Buzzvil Can we calculate revenue loss when ad allocation is down? Our Cost of Downtime 💸💸 ● Unable to serve ad request ○ Direct revenue lost ○ Cost of communication ○ We might have to compensate our advertisers for the disruption(service credit, refund, ..) ● Publishers will complain ○ Cost of communication ○ Potential risk to partnership itself ● User happiness is decreased ○ Increased CS ○ Unhappy users might drop out Lot of hidden cost in addition to direct revenue lost(operational cost, losing business opportunity, …) Direct threat to success of our product!

Slide 7

Slide 7 text

Copyright ⓒ All Right Reserved by Buzzvil ● a measurement that is determined over a metric, or a piece of data representing some property of a service ● most often useful if it can result in a binary “good” or “bad” outcome for each event ● Some examples: ○ RPC is successful ○ p95(95% of request) latency is below 500ms SLI(Service Level Indicator) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo

Slide 8

Slide 8 text

Copyright ⓒ All Right Reserved by Buzzvil SLO(Service Level Objective) The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo ● “proper level of reliability” targeted by the service ● 99%, 99.9%, 99.95%, 99.99%, ... (good + bad) events good events Availability = total minutes good minutes Availability = 😔 😄 SLO SLI

Slide 9

Slide 9 text

Copyright ⓒ All Right Reserved by Buzzvil Error Budget The Reliability Stack The basic building blocks of the Reliability Stack. “Implementing Service Level Objectives”, Alex Hidalgo ● Product & Engineering defines SLO targets ● [100% - SLO target]: “budget of unreliability” ● We can have control loop for utilizing budget SLO Target Yearly allowed downtime Monthly allowed downtime 99.99% uptime 52 min, 35 sec 4 min, 23 sec 99.95% uptime 4 hours, 22 min, 48 sec 21 min, 54 sec 99.9% uptime 8 hours, 45 min, 57 sec 43 min, 50 sec 99.5% uptime 43 hours, 49 min, 45 sec 3 hours, 39 min 99% uptime 87 hours, 39 min 7 hours, 18 min 😔 😄 SLO SLI Current Budget

Slide 10

Slide 10 text

Copyright ⓒ All Right Reserved by Buzzvil ● Availability ○ Proportion of valid requests processed successfully. ● Latency ○ Proportion of valid requests served faster than a threshold. Typical RPC SLIs(HTTP JSON API, gRPC, …) Request/Response

Slide 11

Slide 11 text

Copyright ⓒ All Right Reserved by Buzzvil ● Freshness ○ The proportion of valid data updated more recently than a threshold. ● Coverage ○ The proportion of valid data processed successfully. Typical SLIs for data processing jobs(i.e. Report Generator, ML Training Pipeline, ...) Data Processing

Slide 12

Slide 12 text

Copyright ⓒ All Right Reserved by Buzzvil Simple is the best Choosing SLIs Pick 1~3 SLIs per user story

Slide 13

Slide 13 text

Copyright ⓒ All Right Reserved by Buzzvil SLI/SLO should evolve over time Choosing SLIs ● We can’t pick perfect SLI/SLO from the beginning. ○ Pick simple and typical SLIs first ○ For choosing threshold, we can always refer to historical metric ○ After running for some amount of time, we’ll get feedback from variety of sources(users, other teams, business, ...) ● Merge user stories when possible(except for some critical paths) ● Don’t spend too much time choosing threshold. Group user stories into some buckets. ○ i.e. RPC latency ■ interactive - 500ms ■ background - 5s ■ write - 1.5s

Slide 14

Slide 14 text

Copyright ⓒ All Right Reserved by Buzzvil Example - Pixelsvc ● SDK ○ Percentage of HTTP GET requests for /buzzvil-pixel.js with 200~399 status measured at CDN ● Track Event ○ Percentage of HTTP GET requests for /track with 200~399 status measured at load balancer | APM ● Data Processing ○ Percentage of tracked events successfully processed within 1 hour measured at datastore

Slide 15

Slide 15 text

Copyright ⓒ All Right Reserved by Buzzvil ● 레이턴시가 500ms 정도를 유지하던 서비스가 있는데, 합리적인 범위인지 판단을 해야함. 만약 유저가 느끼기에 평소에 500ms 이내에 꾸준히 반응했으면 어느날 1s의 latency가 생기면 성능이 저하되었다고 느낄 것. ○ 시간이 지남에 따라 부하 증가(디비 병목이거나 사용량이 증가해서 서버 saturation이 높아졌을 때) 로 인해 latency가 점진적으로 증가하면 우리는 모니터링만 보고서는 이를 알아채기가 어렵다. ○ 500 ms가 어떻게 정해졌건, SLI화 해두면 error budget이 감소할 것이기에 우린 알아챌 수 있음. x Product Teams

Slide 16

Slide 16 text

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI.. Product Teams ● What if latency of a service gradually increases over time? ○ Bad autoscaling configuration ○ User acquisition ○ Due to increased datastore load ● At some day users will notice the degradation.(1s -> 3s)

Slide 17

Slide 17 text

Copyright ⓒ All Right Reserved by Buzzvil Imagine without SLI.. Product Teams Ops Product / Engineering 🛑 Reliability first! 🚀 Velocity first! 💦 friction...

Slide 18

Slide 18 text

Copyright ⓒ All Right Reserved by Buzzvil Utilizing Error Budget 99.9% SLI 😄 99.95% 99% 🚨 Danger zone. Freeze, and improve reliability 🔬 Do experiments, Feature release, ... 󰳕 Monitoring, Some of maintenance

Slide 19

Slide 19 text

Copyright ⓒ All Right Reserved by Buzzvil Let’s push every events to datadog and create SLO metrics! ● AWS metrics, Sentry exceptions, We’ve got nice tools

Slide 20

Slide 20 text

Copyright ⓒ All Right Reserved by Buzzvil ● Design Doc - needs some improvements ● Production readiness checklist ○ CI/CD ○ helm chart, templates, pipeline with some of nice defaults to start with ○ Graceful shutdown ○ Monitoring(APM, Sentry) ● Infrastructure management(RDS, Dynamodb, …) - with help from DevOps ● Define SLI/SLO Some useful practice we already have Product Teams

Slide 21

Slide 21 text

Copyright ⓒ All Right Reserved by Buzzvil ● Error budget에 영향을 미친 요인은 무엇인지 ○ Datastore의 로드 ○ 잘못된 autoscaling 설정 ○ 리소스 할당이 부족 ○ Lack of graceful termination → deployment might consume error budget! ○ Spike pattern in services(push notification) ■ 여기서 잠깐, 우리는 퍼블리셔의 프로모션을 미리 알고 대응할 수 없다는 제약이 있다. ● 사업단과 논의하여 대규모 프로모션에 대한 정보를 미리 습득할 수 있어야 함 ■ 제품팀에서 알고있는 릴리즈 일정이 있다면 이는 미리 수요 예측에 반영되어 있어야 한다. ● Product Teams

Slide 22

Slide 22 text

Copyright ⓒ All Right Reserved by Buzzvil ● Google SRE Book ● Implementing Service Level Objectives ● https://www.slideshare.net/Pivotal/six-simple-steps-to-service-level-obje ctives-slos ● Solving reliability fears with service level objectives (Google) Useful Resources

Slide 23

Slide 23 text

Copyright ⓒ All Right Reserved by Buzzvil Thank you