Slide 1

Slide 1 text

ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo 2023

Slide 2

Slide 2 text

Who am I chaspy chaspy_ Engineering Manager Site Reliability and Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me

Slide 3

Slide 3 text

SRE NEXT 2020 & 2022 • 2020 • SLI/SLO ͱ͍͏ݴ༿͕ͳ͍ঢ়ଶͰ૊৫ ΁ಋೖΛࢼΈͨࣄྫ • 2022 • SLI/SLO Λಋೖͨ͠ޙͷ࿩ • ૊৫શମͰ Site Reliability Engineering ΛਐΊΔͨΊʹඞཁͳ͜ͱΛߟ͑ͨ

Slide 4

Slide 4 text

SRE & Web Application Development 2018 2020 2021 2023 2019 2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ૊৫ʹಋೖ ͠Α͏ͱؤுΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹ΋ࢀՃ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ 4-0DPOG 5PLZP✨

Slide 5

Slide 5 text

SRE & Web Application Development 2018 2020 2021 2023 2019 2022 2VJQQFS ೖࣾ 43&/&95 4-0Λ૊৫ʹಋೖ ͠Α͏ͱؤுΔ 43&/&95 &OHJOFFSJOH .BOBHFSʹͳΔ &OHJOFFSJOH.BOBHFSͱͯ͠ 8FC։ൃνʔϜʹ΋ࢀՃ ࠓ೔͸։ൃऀ໨ઢͰ࿩͠·͢ʂ 4-0DPOG 5PLZP✨

Slide 6

Slide 6 text

ࠓ೔఻͍͑ͨ͜ͱ Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

Slide 7

Slide 7 text

Ұ౓ઃఆͯ͠ݟ௚͞ͳ͔ͬͨΒͲ͏ͳ͔ͬͨͷ࿩Λ͠·͢😅

Slide 8

Slide 8 text

ʰελσΟαϓϦʱʹ͓͚Δ SLI/SLO ͷܧଓతվળ Λ͜Ε͔Β΍͍ͬͯͧ͘ͱ͍͏࿩ Takeshi Kondo / @chaspy 2023/05/13 SLOconf Tokyo 2023

Slide 9

Slide 9 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 10

Slide 10 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 11

Slide 11 text

લఏɿϓϩμΫτ঺հ - ελσΟαϓϦ

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

2022೥2݄ʹϦχϡʔΞϧ • Ϣʔβج൫Ҏ֎ͷ෦෼Λ৽نϚΠΫϩ αʔϏεͱͯ͠2೥ʹ౉Γ։ൃ • ϦϦʔε͔Β1೥ܦաɻݱࡏ΋ܧଓత ʹΤϯϋϯε͍ͯ͠·͢ https://www.recruit.co.jp/newsroom/pressrelease/2022/0131_9881.html ϦχϡʔΞϧͷϙΠϯτʂ ࠓिͷϛογϣϯͱ൓෮ԋशػೳʹΑΔݸผֶशࢧԉ ԋशྔɾ೉қ౓Λେ෯֦ॆ ʮఆظςετରࡦߨ࠲ʯΛؚΉ৽ߨ࠲͕ଓʑొ৔ ֶशը໘ͷσβΠϯΛҰ৽

Slide 14

Slide 14 text

උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε

Slide 15

Slide 15 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 16

Slide 16 text

Why SLI/SLO? • ػೳ։ൃorඇػೳ։ൃɺͲͪΒʹ࣌ؒΛ࢖͏ͷ͔Λ Fact-BasedͰܾఆ͢ΔͨΊ • Error Budget ͕͋ Δ͏ͪ͸1ͭ1ͭͷ Τϥʔʹରॲ͠ͳ͍ • Burn Out Λආ͚ΒΕΔ • Budget ͕͋Δ͏ͪ͸Ϧε Ϋ͕ͱΕΔ

Slide 17

Slide 17 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 18

Slide 18 text

උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY • SLI/SLO ͸શ෦Ͱ8ͭ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ࢖͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec ᶃ ᶄ ᶅ ᶆ

Slide 19

Slide 19 text

Why Envoy? • ౰࣌͸ϚΠΫϩαʔϏεؒͷ metrics Λऔಘ͢Δํ๏͕ ͳ͔ͬͨ • Control Plane ΛؚΜͩ Service Mesh Ͱ͸ͳ͘ɺSide- car container ͱͯ͠୯ʹૉͷ Envoy ΛࡌͤΔͷΈ

Slide 20

Slide 20 text

DevSupport: ೔ସΘΓ౰൪Ͱఆৗӡ༻ۀ຿Λߦ͏ • Slack ͷ௨஌Λ֬ೝͯ͠ݪҼௐࠪ • Sentry Exception, SLO Alert, GCP Pub/Sub Dead Letter • खಈରԠ͕ඞཁͳ΋ͷ͸֤νʔϜʹΤεΧϨʔγϣϯ • CS(Customer Support)໰͍߹ΘͤͷҰ࣍ड͚ • શମ޲͚ϝϯγϣϯͷ1࣍ड͚

Slide 21

Slide 21 text

ى͖͍ͯͨ՝୊: No SLO Alert • ϦϦʔε͔Βࠓ·ͰҰ౓΋ SLO Alert ͕໐ͬͨ͜ͱ͸ͳ͍ • Sentry ͷ Exception ྔ͕ SLI ʹ൓ө͞Ε͍ͯͳ͍ؾ͕͢Δ • Կ͕ى͖͍ͯΔͷͩΖ͏͔ʁ • গͳ͘ͱ΋ Sentry Exception Λ1݅ͣͭݟ͍ͯΔ࣌఺Ͱ Error Budget ͱ͍͏֓೦ ͸ར༻Ͱ͖ͯͳ͍ • SLO ͕ࣗ෼ͨͪͷظ଴஋ΑΓ΋؇͗͢Δʁ • SLI ͷઃఆ͕ޡ͍ͬͯΔʁ • ௐࠪͨ͠

Slide 22

Slide 22 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 23

Slide 23 text

Ծઆ: Envoy ͷ metrics (SLIᶄᶅᶆ) ͕͓͔͍͠ͷͰ͸ʁ • Yes • Exception ͷҰ෦͸ DNS ໊લղܾͰࣦഊ͍ͯͨ͠ • ͭ·Γɺhttp request ʹࢸ͍ͬͯͳ͍ • envoy.cluster.upstream_rq_2xx ʹܭ্͞Εͳ͍ͷ͸ͦΕ͸ͦ͏ • ᶄͷ௨৴࣌ɺ໊લղܾʹࣦഊύλʔϯ • ᶃ ͷ SLI Ͱܭଌ͞Ε͍ͯΕ͹ྑ͍͕…? ௨ৗͷ௨৴ UBSBBQJHBUFXBZDPOUBJOFS͕IUUQ UBSBNBJOΛ໊લղܾ͢Ċ͕͜͜ ࣦഊͨ͠ IUUQUBSBNBJOͰ௨৴͢Δ

Slide 24

Slide 24 text

Ծઆ2: Reverse Proxy ͷ metrics (SLIᶃ) ͕͓͔͍͠ͷͰ͸ʁ • Yes • GraphQL ϦΫΤετ్͕தͰࣦഊͨ͠৔߹ɺhttp Ͱ͸ 200 Λฦ͍ͯͨ͠😱 • ϦϦʔε࣌ɺ෦෼ࣦഊ͸ 500 Ͱฦ͢͜ͱΛܾΊ͕ͨɺͦ͏͞Ε͍ͯͳ͔ͬͨ ௨ৗͷ௨৴ $MJFOU͔ΒIUUQTKVOJPSMFBSOTUVEZTBQVSJKQʹΞΫηε͢Δͱ 3FWFSTF1SPYZʹ౸ୡ 3FWFSTF1SPYZ͔ΒțBSBBQJHBUFXBZ΁QSPYZᶄ UBSBBQJHBUFXBZ͔ΒțBSBNBJO΁௨৴ᶄŠ͜͜ͰΤϥʔ͕ൃੜ

Slide 25

Slide 25 text

ରॲ1ɿGraphQL Error ͷ৔߹ http 500 Λฦ͢ • ݩʑ GraphQL ͸ http ͷ͜ͱΛؾʹ͍ͯ͠ͳ͍ • ڍಈ͸ GraphQL server library ͷڍಈʹґଘ͢Δ • Response status ͸ 200 ʹ౷Ұ͢ΔϓϥΫςΟε΋͋Δ • Client ΋ Error ͸ Response ͷ errors ΛݟΔͷͰ໰୊͸ͳ͍ ಉ྅͕γϡοͱ௚ͯ͘͠Ε·ͨ͠🙏 4QFDJBM5IBOLT!2VSBNZ

Slide 26

Slide 26 text

ରॲ2ɿ Envoy Λ΍Ίͯ Datadog APM metrics Λར༻ • ෳࡶੑʹΑΔτϥϒϧγϡʔτͷ೉͠͞ΛݮΒͨ͢Ί • Envoy ͷ metrics ʹ໰୊͕͋ͬͨΘ͚Ͱ͸ͳ͍ • ӡ༻ͷ՝୊΋ଟ͘ metrics औಘҎ֎ͷϝϦοτ͸ಘΒΕ͍ͯͳ͔ͬͨ • Curcuit Breaker ೖΕ͍ͯͨ΋ͷͷൃಈͨ͠έʔε͸΄ͱΜͲͳ͍ • Envoy ͷ version up ରԠʢग़དྷ͍ͯͳ͍ʣ • Pod ಺ side-car container ͷىಈɾऴྃॱ੍ޚʢenvoy Λ଴ͨͳ͍ͱΤϥʔʹͳΔʣ • Rollouts Λ࢖͍ͬͯΔ৔߹ͷ Patch ํ๏ʢResource ٯసͯ͠ো֐ʹͳͬͨ͜ͱ΋ʣ

Slide 27

Slide 27 text

খωλ: Datadog APM ݁ߏบ͕͋Δ(1) • http client ͷ APM Plugin ͷ resource tag ͸ default Ͱ͸ http method Ͱ͋Δ • Ѽઌ͝ͱͷ SLI ͱͯ͠࠾༻͢Δʹ͸ hostname ͕ඞཁ • Node, Ruby ͰͦΕͧΕରԠ • ૊৫಺Ͱ http-client ͷ resource tag ͷ໋໊ن໿Λ߹ҙ

Slide 28

Slide 28 text

খωλ: Datadog APM ݁ߏบ͕͋Δ(2) • trace.http.request.errors Ͱ͸ http 5xx ͸֘౰͠ͳ͍ • ٯʹ 4xx ͸֘౰͢Δ • trace.http.request.hits.by_http_status Λར༻͢Δඞཁ͕͋Δ

Slide 29

Slide 29 text

උߟ: tara ͱ͍͏ͷ͸͜ͷϦχϡʔΞϧϓϩδΣΫτͷίʔυωʔϜͰɺ࠷ۙΠϯλϏϡʔͰύϒϦοΫʹͳͬͨ https://brand.studysapuri.jp/career/interview/article/Saori_Suzuki/ ݩʑ͋ͬͨ Ϣʔβج൫Λ ؚΉαʔϏε 3FWFSTF1SPYZ /HJOY • SLO Λݟ௚ͨ͠ • (a)Availability ͱ (b)Latency • http ͷ metrics Λ࢖͏ • ҎԼͷ4Օॴʹ(a/b)2छྨͣͭ • ᶃ api-gateway • ᶄ api-gateway -> main • ᶅ api-gawatey -> content • ᶆ main -> content • 🆕ᶇ api-gateway -> Ϣʔβج൫΁ͷ request • SLO • Availability: 99.9% • Latency: 95 percentile < 1000msec • -> αʔϏε͝ͱʹݱঢ়ΛՃຯ͠ɺ 100~500msec ᶃ ᶄ ᶅ ᶆ ᶇ ϚΠΫϩαʔϏε͝ͱͷ 4-*4-0Λഇࢭ 4-*Λ෼͚ΔϝϦοτ͕ෳ਺ 4-*4-0Λ؅ཧ͢Δίετʹ ݟ߹͍ͬͯͳ͍ͨΊ Ϣʔβج൫޲͚4-*4-0௥Ճ Ϣʔβج൫޲͚ͷڞ௨4-*͜Ε·ͰFOWPZNFUSJDT Λར༻͍ͯͨ͠ɻFOWPZΛ֎ͨͨ͠Ί%BUBEPH "1.NFUSJDTΛར༻ͨ͠4-*4-0Λ௥Ճ

Slide 30

Slide 30 text

DevSupport ݟ௚͠ • Sentry Exception Ͱ͸ΞϓϦέʔγϣϯίʔυىҼͷ΋ͷҎ ֎͸શͯ Ignore ͢Δ • SLO Alert ͕དྷͨ࣌ͷجຊతͳରॲํ਑ΛυΩϡϝϯτԽ • ౰೔ରԠͰ͖ͳ͔ͬͨ΋ͷΛ2िؒʹ1ճνʔϜͰରԠ

Slide 31

Slide 31 text

Outline • ࣗݾ঺հ • ʰελσΟαϓϦ தֶߨ࠲ʱʹ͍ͭͯ • SLI/SLO ͸ͳΜͷͨΊʹ͋Δͷ͔ • αʔϏεӡ༻ͷݱঢ়ͱ՝୊ • ՝୊ʹ࣮͋ͨͬͯࡍʹऔΓ૊Μͩ͜ͱ • ·ͱΊ

Slide 32

Slide 32 text

Կ͕ى͖͍ͯͨͷ͔ • ϦϦʔε࣌ʹҰ౓ઃఆ͞Εͨ SLI/SLO ͸1౓΋ݟ௚͞Εͯͳ ͔ͬͨ • SLO ͕ԿͷՁ஋΋ൃش͍ͯ͠ͳ͔ͬͨ • SLI/SLO ྆ํΛݟ௚͠ɺࠓޙ΋ܧଓతʹݟ௚͢͜ͱʹͨ͠

Slide 33

Slide 33 text

Ͳ͏͢΂͖ͩͬͨͷ͔ • ։ൃऀࢹ఺ • SLO ͕ຊ౰ʹՁ஋Λ΋ͨΒ͍ͯ͠Δͷ͔Λఆظతʹݕࠪ͢Δ • ౰൪΋ྑ͍͕ɺͨ·ʹ͸શһͰݟΔ࣌ؒΛऔΔ͜ͱ΋ॏཁ • 1೔ަ୅ͩͱਂ͘ௐ΂ΔΠϯηϯςΟϒ͕ಇ͔ͳ͍ • SRE ࢹ఺ • ։ൃνʔϜ͕ SLI/SLO Λఆظతʹݟ௚͢࢓૊ΈΛ࡞Δ • ϫʔΫϩʔυ͝ͱʹ SLI/SLO Λࣗಈੜ੒͢Δ࢓૊ΈΛ࡞Δ

Slide 34

Slide 34 text

ࠓ೔఻͍͑ͨ͜ͱ Ұ౓ܾΊͨ SLI/SLO ͸ܧଓతʹݟ௚͠·͠ΐ͏

Slide 35

Slide 35 text

Thank you! chaspy chaspy_ Engineering Manager Site Reliability and Web Application Development at Recruit Co., Ltd. Takeshi Kondo https://chaspy.me