Upgrade to Pro — share decks privately, control downloads, hide ads and more …

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri

『スタディサプリ』における SLI/SLO の継続的改善 / Continuous improvement of SLI/SLO at StudySapuri

Takeshi Kondo

May 16, 2023
Tweet

More Decks by Takeshi Kondo

Other Decks in Technology

Transcript

 1. ʰελσΟαϓϦʱʹ͓͚Δ


  SLI/SLO ͷܧଓతվળ
  Takeshi Kondo / @chaspy


  2023/05/13


  SLOconf Tokyo 2023

  View Slide

 2. Who am I
  chaspy chaspy_
  Engineering Manager

  Site Reliability and Web Application Development

  at Recruit Co., Ltd.
  Takeshi Kondo
  https://chaspy.me

  View Slide

 3. SRE NEXT 2020 & 2022
  • 2020


  • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ૊৫
  ΁ಋೖΛࢼΈͨࣄྫ


  • 2022


  • SLI/SLO Λಋೖͨ͠ޙͷ࿩


  • ૊৫શମͰ Site Reliability Engineering
  ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ

  View Slide

 4. SRE & Web Application Development
  2018 2020 2021 2023
  2019 2022
  2VJQQFS
  ೖࣾ
  43&/&95

  4-0Λ૊৫ʹಋೖ
  ͠Α͏ͱؤுΔ
  &OHJOFFSJOH.BOBHFSͱͯ͠
  8FC։ൃνʔϜʹ΋ࢀՃ
  43&/&95

  &OHJOFFSJOH
  .BOBHFSʹͳΔ
  4-0DPOG
  5PLZP✨

  View Slide

 5. SRE & Web Application Development
  2018 2020 2021 2023
  2019 2022
  2VJQQFS
  ೖࣾ
  43&/&95

  4-0Λ૊৫ʹಋೖ
  ͠Α͏ͱؤுΔ
  43&/&95

  &OHJOFFSJOH
  .BOBHFSʹͳΔ
  &OHJOFFSJOH.BOBHFSͱͯ͠
  8FC։ൃνʔϜʹ΋ࢀՃ
  ࠓ೔͸։ൃऀ໨ઢͰ࿩͠·͢ʂ
  4-0DPOG
  5PLZP✨

  View Slide

 6. ࠓ೔఻͍͑ͨ͜ͱ


  Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

  View Slide

 7. Ұ౓ઃఆͯ͠ݟ௚͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷ࿩Λ͠·͢😅

  View Slide

 8. ʰελσΟαϓϦʱʹ͓͚Δ


  SLI/SLO ͷܧଓతվળ


  Λ͜Ε͔Β΍͍ͬͯͧ͘ͱ͍͏࿩
  Takeshi Kondo / @chaspy


  2023/05/13


  SLOconf Tokyo 2023

  View Slide

 9. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 10. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 11. લఏɿϓϩμΫτ঺հ - ελσΟαϓϦ

  View Slide

 12. View Slide

 13. 2022೥2݄ʹϦχϡʔΞϧ
  • Ϣʔβج൫Ҏ֎ͷ෦෼Λ৽نϚΠΫϩ
  αʔϏεͱͯ͠2೥ʹ౉Γ։ൃ


  • ϦϦʔε͔Β1೥ܦաɻݱࡏ΋ܧଓత
  ʹΤϯϋϯε͍ͯ͠·͢
  https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html
  ϦχϡʔΞϧͷϙΠϯτʂ
  ࠓिͷϛογϣϯͱ൓෮ԋशػೳʹΑΔݸผֶशࢧԉ
  ԋशྔɾ೉қ౓Λେ෯֦ॆ
  ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ৔
  ֶशը໘ͷσβΠϯΛҰ৽

  View Slide

 14. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
  ݩʑ͋ͬͨ
  Ϣʔβج൫Λ
  ؚΉαʔϏε

  View Slide

 15. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 16. Why SLI/SLO?
  • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ࢖͏ͷ͔Λ
  Fact-BasedͰܾఆ͢ΔͨΊ
  • Error Budget ͕͋
  Δ͏ͪ͸1ͭ1ͭͷ
  Τϥʔʹରॲ͠ͳ͍


  • Burn Out Λආ͚ΒΕΔ


  • Budget ͕͋Δ͏ͪ͸Ϧε
  Ϋ͕ͱΕΔ

  View Slide

 17. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 18. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
  ݩʑ͋ͬͨ
  Ϣʔβج൫Λ
  ؚΉαʔϏε
  3FWFSTF1SPYZ /HJOY

  • SLI/SLO ͸શ෦Ͱ8ͭ


  • (a)Availability ͱ (b)Latency


  • http ͷ metrics Λ࢖͏


  • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ


  • ᶃ api-gateway


  • ᶄ api-gateway -> main


  • ᶅ api-gawatey -> content


  • ᶆ main -> content


  • SLO


  • Availability: 99.9%


  • Latency: 95 percentile < 1000msec

  ᶄ ᶅ

  View Slide

 19. Why Envoy?
  • ౰࣌͸ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕
  ͳ͔ͬͨ
  • Control Plane ΛؚΜͩ
  Service Mesh Ͱ͸ͳ͘ɺSide-
  car container ͱͯ͠୯ʹૉͷ
  Envoy ΛࡌͤΔͷΈ

  View Slide

 20. DevSupport: ೔ସΘΓ౰൪Ͱఆৗӡ༻ۀ຿Λߦ͏
  • Slack ͷ௨஌Λ֬ೝͯ͠ݪҼௐࠪ


  • Sentry Exception, SLO Alert, GCP Pub/Sub Dead Letter


  • खಈରԠ͕ඞཁͳ΋ͷ͸֤νʔϜʹΤεΧϨʔγϣϯ


  • CS(Customer Support)໰͍߹ΘͤͷҰ࣍ड͚


  • શମ޲͚ϝϯγϣϯͷ1࣍ड͚

  View Slide

 21. ى͖͍ͯͨ՝୊: No SLO Alert
  • ϦϦʔε͔Βࠓ·ͰҰ౓΋ SLO Alert ͕໐ͬͨ͜ͱ͸ͳ͍


  • Sentry ͷ Exception ྔ͕ SLI ʹ൓ө͞Ε͍ͯͳ͍ؾ͕͢Δ


  • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ


  • গͳ͘ͱ΋ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌఺Ͱ Error Budget ͱ͍͏֓೦
  ͸ར༻Ͱ͖ͯͳ͍


  • SLO ͕ࣗ෼ͨͪͷظ଴஋ΑΓ΋؇͗͢Δʁ


  • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ


  • ௐࠪͨ͠

  View Slide

 22. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 23. Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰ͸ʁ
  • Yes


  • Exception ͷҰ෦͸ DNS ໊લղܾͰࣦഊ͍ͯͨ͠


  • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍


  • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷ͸ͦΕ͸ͦ͏


  • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ


  • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕ͹ྑ͍͕…?
  ௨ৗͷ௨৴
  UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ
  UBSBNBJOΛ໊લղܾ͢Ċ͕͜͜
  ࣦഊͨ͠
  IUUQUBSBNBJOͰ௨৴͢Δ

  View Slide

 24. Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰ͸ʁ
  • Yes


  • GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠৔߹ɺhttp Ͱ͸ 200 Λฦ͍ͯͨ͠😱


  • ϦϦʔε࣌ɺ෦෼ࣦഊ͸ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ
  ௨ৗͷ௨৴
  $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ
  3FWFSTF1SPYZʹ౸ୡ
  3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZ΁QSPYZᶄ
  UBSBBQJHBUFXBZ͔ΒțBSBNBJO΁௨৴ᶄŠ͜͜ͰΤϥʔ͕ൃੜ

  View Slide

 25. ରॲ1ɿGraphQL Error ͷ৔߹ http 500 Λฦ͢
  • ݩʑ GraphQL ͸ http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍


  • ڍಈ͸ GraphQL server library ͷڍಈʹґଘ͢Δ


  • Response status ͸ 200 ʹ౷Ұ͢ΔϓϥΫςΟε΋͋Δ


  • Client ΋ Error ͸ Response ͷ errors ΛݟΔͷͰ໰୊͸ͳ͍
  ಉ྅͕γϡοͱ௚ͯ͘͠Ε·ͨ͠🙏
  4QFDJBM5IBOLT!2VSBNZ

  View Slide

 26. ରॲ2ɿ Envoy Λ΍Ίͯ Datadog APM metrics Λར༻
  • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ೉͠͞ΛݮΒͨ͢Ί


  • Envoy ͷ metrics ʹ໰୊͕͋ͬͨΘ͚Ͱ͸ͳ͍


  • ӡ༻ͷ՝୊΋ଟ͘ metrics औಘҎ֎ͷϝϦοτ͸ಘΒΕ͍ͯͳ͔ͬͨ


  • Curcuit Breaker ೖΕ͍ͯͨ΋ͷͷൃಈͨ͠έʔε͸΄ͱΜͲͳ͍


  • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ


  • Pod ಺ side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λ଴ͨͳ͍ͱΤϥʔʹͳΔʣ


  • Rollouts Λ࢖͍ͬͯΔ৔߹ͷ Patch ํ๏ʢResource ٯసͯ͠ো֐ʹͳͬͨ͜ͱ΋ʣ

  View Slide

 27. খωλ: Datadog APM ݁ߏบ͕͋Δ(1)
  • http client ͷ APM Plugin ͷ resource tag ͸ default Ͱ͸
  http method Ͱ͋Δ


  • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ͸ hostname ͕ඞཁ


  • Node, Ruby ͰͦΕͧΕରԠ


  • ૊৫಺Ͱ http-client ͷ resource tag ͷ໋໊ن໿Λ߹ҙ

  View Slide

 28. খωλ: Datadog APM ݁ߏบ͕͋Δ(2)
  • trace.http.request.errors Ͱ͸ http 5xx ͸֘౰͠ͳ͍


  • ٯʹ 4xx ͸֘౰͢Δ


  • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ

  View Slide

 29. උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/
  ݩʑ͋ͬͨ
  Ϣʔβج൫Λ
  ؚΉαʔϏε
  3FWFSTF1SPYZ /HJOY

  • SLO Λݟ௚ͨ͠


  • (a)Availability ͱ (b)Latency


  • http ͷ metrics Λ࢖͏


  • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ


  • ᶃ api-gateway


  • ᶄ api-gateway -> main


  • ᶅ api-gawatey -> content


  • ᶆ main -> content  🆕ᶇ api-gateway -> Ϣʔβج൫΁ͷ request


  • SLO


  • Availability: 99.9%


  • Latency: 95 percentile < 1000msec


  • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ
  100~500msec

  ᶄ ᶅ


  ϚΠΫϩαʔϏε͝ͱͷ
  4-*4-0Λഇࢭ
  4-*Λ෼͚ΔϝϦοτ͕ෳ਺
  4-*4-0Λ؅ཧ͢Δίετʹ
  ݟ߹͍ͬͯͳ͍ͨΊ
  Ϣʔβج൫޲͚4-*4-0௥Ճ
  Ϣʔβج൫޲͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT
  Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH
  "1.NFUSJDTΛར༻ͨ͠4-*4-0Λ௥Ճ

  View Slide

 30. DevSupport ݟ௚͠
  • Sentry Exception Ͱ͸ΞϓϦέʔγϣϯίʔυىҼͷ΋ͷҎ
  ֎͸શͯ Ignore ͢Δ


  • SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํ਑ΛυΩϡϝϯτԽ


  • ౰೔ରԠͰ͖ͳ͔ͬͨ΋ͷΛ2िؒʹ1ճνʔϜͰରԠ

  View Slide

 31. Outline
  • ࣗݾ঺հ


  • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ


  • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔


  • αʔϏεӡ༻ͷݱঢ়ͱ՝୊


  • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ


  • ·ͱΊ

  View Slide

 32. Կ͕ى͖͍ͯͨͷ͔
  • ϦϦʔε࣌ʹҰ౓ઃఆ͞Εͨ SLI/SLO ͸1౓΋ݟ௚͞Εͯͳ
  ͔ͬͨ


  • SLO ͕ԿͷՁ஋΋ൃش͍ͯ͠ͳ͔ͬͨ


  • SLI/SLO ྆ํΛݟ௚͠ɺࠓޙ΋ܧଓతʹݟ௚͢͜ͱʹͨ͠

  View Slide

 33. Ͳ͏͢΂͖ͩͬͨͷ͔
  • ։ൃऀࢹ఺


  • SLO ͕ຊ౰ʹՁ஋Λ΋ͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ


  • ౰൪΋ྑ͍͕ɺͨ·ʹ͸શһͰݟΔ࣌ؒΛऔΔ͜ͱ΋ॏཁ


  • 1೔ަ୅ͩͱਂ͘ௐ΂ΔΠϯηϯςΟϒ͕ಇ͔ͳ͍


  • SRE ࢹ఺


  • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ௚͢࢓૊ΈΛ࡞Δ


  • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ੒͢Δ࢓૊ΈΛ࡞Δ

  View Slide

 34. ࠓ೔఻͍͑ͨ͜ͱ


  Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

  View Slide

 35. Thank you!
  chaspy chaspy_
  Engineering Manager

  Site Reliability and Web Application Development

  at Recruit Co., Ltd.
  Takeshi Kondo
  https://chaspy.me

  View Slide