Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Observability — Extending Into Incident Response

Observability — Extending Into Incident Response

Observability Conference Tokyo 2025の登壇資料です。
https://o11ycon.jp/

Avatar for Narimichi Takamura

Narimichi Takamura

October 27, 2025
Tweet

More Decks by Narimichi Takamura

Other Decks in Technology

Transcript

  1. 2

  2. גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE

    as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛ΍ͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
  3. SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε •

    ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮૷ • ΠϯγσϯτϚωδϝϯτͷվળ 4
  4. 6

  5. 7

  6. 8

  7. ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰ͸վળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱ͸Ή͔͍ͣ͠ => IR SaaSͷ࡞Γख /

    SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ࿩͠·͢ɻ => ऴ൫Ͱ͸ʢιϑτ΢ΣΞͰ͸ͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ΋౿ΈࠐΜͰ͓࿩͠·͢ɻ 9
  8. ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ໰୊఺ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4.

    TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬ΁ద༻͢Δ 11
  9. MTTRʢฏۉ෮چ࣌ؒʣ ͱ͸ • ো֐͕ൃੜ͔ͯ͠Βम෮·ͨ͸෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair,

    Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ਺ 1 MTTRʢฏۉ෮چ࣌ؒʣͱ͸ʁܭࢉํ๏ͱMTBFͱͷނো཰ɾՔಇ཰ʹ ͓͚Δؔ܎ 18
  10. 19

  11. MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2෼ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩ෼ΛऔΔ •

    diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷ୹ॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
  12. 23

  13. Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • Πϯγσϯτ͸ނোظؒͷ͹Β͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR

    ʹ൓ө͞ΕͮΒ͍ • ex. ʮࡢ೥ൺMTTR10%վળʂʯ͸௕ظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ΋ • ※ ຖ೥·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ஋͕͋Δ(ϜϦ) • ݁࿦ • MTTR ͸վળͷධՁࢦඪͱͯ͠໾ʹཱͨͳ͍ • MTTR͸෼෍ͷ࿪Έʹऑ͘ɺΠϯγσϯτσʔλ͸͹Β͖͕ܹ͍͔ͭ͠Β 28
  14. 39

  15. 44

  16. 45

  17. 47

  18. 52

  19. ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝୊ վળࢪࡦ TTDetectʢݕ஌ʣ ൃੜ͔ͯ͠Βݕ஌·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏ੒ʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ

    ͔͔Δ γϑτ΍໾ׂͷ໌֬ԽɺΦ ϯίʔϧ੍౓ͷಋೖ TTInves-gateʢௐࠪʣ ো֐੾Γ෼͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ ੔උ TTFixʢम෮ʣ ো֐ͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴ଎Խ 53
  20. 54

  21. 56

  22. 57

  23. ബͬ͢Βͱ࢒Δ"ยखམͪ"ײ • ঺հͨ͠TTXϝτϦΫε͸ɺ͍ͣΕ΋TTRΛ෼ղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷ୹ॖ ʹ͚ͩয఺͕౰͍ͨͬͯΔ • SREࢹ఺Ͱ͸ αʔϏεͷ৴པੑ ͷ؍఺͕ॏཁ

    • ex. ֶͼ͸͋Δ͔ɺ࠶ൃ๷ࢭ͸͞ΕΔ͔ • ϓϩμΫτӡӦࢹ఺Ͱ͸ ސ٬ͷ৴པੑ ͷ؍఺͕ॏཁ • ex. ސ٬ରԠ͸े෼ʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹ͸ɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
  24. TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ ໨త Incident Response Metrics ७ਮͳ෮چରԠͷ՝୊ಛఆɾվળࢦඪ Customer

    Reliability Metrics ސ٬ରԠͷ՝୊ಛఆɾվળࢦඪ Learning Metrics ૊৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷ෼ੳ => ࠓճ͸ɺCustomer Reliability Metrrics ͷྫΛ঺հ 63
  25. 64

  26. ରԠϓϩηεͷྲྀΕɺґଘؔ ܎Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕ஌ʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠؅ཧ • ΞΫγϣϯ͝ͱʹࡉ෼Խͯ͠౷߹

    • ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨ޻ఔΛಛఆ ✨ 69
  27. πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog

    → AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ஌͍ͬͯΔͷ͸ରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷ͸ඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰ͸ɺΑΓଟ͘ͷ৘ใ͕औಘՄೳʹʂ 70