Observability — Extending Into Incident Response
by
Narimichi Takamura
Link
Embed
Share
Beginning
This slide
Copy link URL
Copy link URL
Copy iframe embed code
Copy iframe embed code
Copy javascript embed code
Copy javascript embed code
Share
Tweet
Share
Tweet
Slide 1
Slide 1 text
No content
Slide 2
Slide 2 text
2
Slide 3
Slide 3 text
גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3
Slide 4
Slide 4 text
SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε • ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮ • ΠϯγσϯτϚωδϝϯτͷվળ 4
Slide 5
Slide 5 text
WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ লྗԽ͕Ͱ͖Δ 5
Slide 6
Slide 6 text
6
Slide 7
Slide 7 text
7
Slide 8
Slide 8 text
8
Slide 9
Slide 9 text
ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰվળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱΉ͔͍ͣ͠ => IR SaaSͷ࡞Γख / SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ͠·͢ɻ => ऴ൫ͰʢιϑτΣΞͰͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ౿ΈࠐΜͰ͓͠·͢ɻ 9
Slide 10
Slide 10 text
ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ IR ͷྖҬ֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10
Slide 11
Slide 11 text
ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4. TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬద༻͢Δ 11
Slide 12
Slide 12 text
1. Ϟνϕʔγϣϯ 12
Slide 13
Slide 13 text
͍: ͦͷԾઆຊͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ෦ঢ়ଶΛਪଌɾѲͰ͖ΔΑ͏ʹͳΔ 3. ൃੜ࣌ʹݪҼಛఆ͕ਝʹͳΓ෮چ͕࣌ؒ͘ͳΔ ← ί Ϩ 13
Slide 14
Slide 14 text
Γ͍ͨ͜ͱ2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ վળͨ͠ͷ͔ 14
Slide 15
Slide 15 text
ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15
Slide 16
Slide 16 text
෮چ࣌ؒͷॖʹޮՌ͕͋Δͣ → MTTR Λଌఆ͢Ε͍͍ͷͰʁ 16
Slide 17
Slide 17 text
2. MTTRͷ 17
Slide 18
Slide 18 text
MTTRʢฏۉ෮چ࣌ؒʣ ͱ • ো͕ൃੜ͔ͯ͠Βम෮·ͨ෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair, Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ 1 MTTRʢฏۉ෮چ࣌ؒʣͱʁܭࢉํ๏ͱMTBFͱͷނোɾՔಇʹ ͓͚Δؔ 18
Slide 19
Slide 19 text
19
Slide 20
Slide 20 text
SREs should move away from defaul/ng to the assump/on that MTTX can be useful. 20
Slide 21
Slide 21 text
MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛॖ͢ΕMTTRॖ͞Ε Δͣ 21
Slide 22
Slide 22 text
MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩΛऔΔ • diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22
Slide 23
Slide 23 text
23
Slide 24
Slide 24 text
݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ50ʙ60% 24
Slide 25
Slide 25 text
֤ΠϯγσϯτΛվળͯ͠MTTR͕վળ͠ͳ͍ཧ༝ • MTTRͷΈʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ"Β͖ͭ"͕ܹ͍͠ 25
Slide 26
Slide 26 text
Πϯγσϯτσʔλͷಛ3 • େ͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢ • ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹େ෯ʹมΘΔ 3 The VOID Report 26
Slide 27
Slide 27 text
ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTRσʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27
Slide 28
Slide 28 text
Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • ΠϯγσϯτނোظؒͷΒ͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR ʹө͞ΕͮΒ͍ • ex. ʮࡢൺMTTR10%վળʂʯظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ • ※ ຖ·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ͕͋Δ(ϜϦ) • ݁ • MTTR վળͷධՁࢦඪͱͯ͠ʹཱͨͳ͍ • MTTRͷΈʹऑ͘ɺΠϯγσϯτσʔλΒ͖͕ܹ͍͔ͭ͠Β 28
Slide 29
Slide 29 text
ͳʹ͕ͩͬͨͷʁ ֤ཁૉͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛͱʹվળͷՌΛ֬ೝ͢Δ͜ͱ → తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕ 29
Slide 30
Slide 30 text
σʔλੳʢԾઆݕূܕʣͷྲྀΕ 30
Slide 31
Slide 31 text
MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31
Slide 32
Slide 32 text
ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ߹ 32
Slide 33
Slide 33 text
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 33
Slide 34
Slide 34 text
ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ͑Δ 34
Slide 35
Slide 35 text
͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹෆద • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ͑Δ͜ͱ͕Մೳ → TTRΑΓࡉ͔͍ϝτϦΫεͷधཁ͕ग़ͯ͘Δ 35
Slide 36
Slide 36 text
3. ࣮ફతͳ TTX ϝτϦΫε 36
Slide 37
Slide 37 text
Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ • ཏతͰ͋Δ • ཻ͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37
Slide 38
Slide 38 text
ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38
Slide 39
Slide 39 text
39
Slide 40
Slide 40 text
TTXϝτϦΫεͷ՝ײ • ੈͷதʹࣄྫ͍͔ͭ͋͘Δ͕ɺఆٛ౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜ΛΈ߹ΘͤΑ͏ͱͯ͠ɺॏෳෆ͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺཏతͳఆٛΛࢦ͢ 40
Slide 41
Slide 41 text
TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41
Slide 42
Slide 42 text
ϕετϓϥΫςΟεΛֶͿ 42
Slide 43
Slide 43 text
ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43
Slide 44
Slide 44 text
44
Slide 45
Slide 45 text
45
Slide 46
Slide 46 text
ϚΠϧετʔϯΛͱʹ TTXʹམͱ͠ࠐΉ 46
Slide 47
Slide 47 text
47
Slide 48
Slide 48 text
ϝτϦΫεऩू͍ͨΜ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ͑Δ͝ͱ ʹλΠϜελϯϓΛه͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷඇݱ࣮త • → Waroom ͰSlack BotͰࣗಈऩू͍ͯ͠·͢ 48
Slide 49
Slide 49 text
ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕʣ Ξϥʔτൃੜ௨ Acknowledgedʢೝʣ νϟϯωϧ࡞ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ͚ʢPrecheck ͱResolu.onʣ Recoveredʢ෮چʣ SlackͷΓͱΓ͔ΒAI͕அ͢Δ 49
Slide 50
Slide 50 text
4. TTXϝτϦΫεͷ׆༻ 50
Slide 51
Slide 51 text
ϝτϦΫεΛޮՌతʹ͏ͨΊʹ ੳͷతͱϝτϦΫεͷಛΛ߹ͤ͞Δ 51
Slide 52
Slide 52 text
52
Slide 53
Slide 53 text
ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝ վળࢪࡦ TTDetectʢݕʣ ൃੜ͔ͯ͠Βݕ·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ ͔͔Δ γϑτׂͷ໌֬ԽɺΦ ϯίʔϧ੍ͷಋೖ TTInves-gateʢௐࠪʣ োΓ͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ උ TTFixʢम෮ʣ োͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴԽ 53
Slide 54
Slide 54 text
54
Slide 55
Slide 55 text
യવͱͨ͠ԾઆΛͱʹɺ͔Β՝Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝ͷྫ ڞ௨ͷڥͳͷͰɺ৫ͷ֤ TTXͷҰఆͷͣ αʔϏενʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTXఆʹ͍ۙͣ ʢex. TTAͳΒ10Ҏ͘Β ͍ʣ ʢ࣮ʣணख͕શମతʹ͍ɺ ղܾࡦͷಛఆ͕શମతʹ͍ 55
Slide 56
Slide 56 text
56
Slide 57
Slide 57 text
57
Slide 58
Slide 58 text
5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58
Slide 59
Slide 59 text
o11yΛIRద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ෦ߏͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫεͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍ ͩΖ͏͔ 2 Unveiling the black box with observability stack 59
Slide 60
Slide 60 text
Metrics 60
Slide 61
Slide 61 text
ബͬ͢ΒͱΔ"ยखམͪ"ײ • հͨ͠TTXϝτϦΫεɺ͍ͣΕTTRΛղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷॖ ʹ͚ͩয͕͍ͨͬͯΔ • SREࢹͰ αʔϏεͷ৴པੑ ͷ؍͕ॏཁ • ex. ֶͼ͋Δ͔ɺ࠶ൃࢭ͞ΕΔ͔ • ϓϩμΫτӡӦࢹͰ ސ٬ͷ৴པੑ ͷ؍͕ॏཁ • ex. ސ٬ରԠेʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61
Slide 62
Slide 62 text
γεςϜ෮چରԠͱฒߦ͍ͯͬͯ͠Δ͜ͱ • ސ٬ͷઆ໌ɾࣄͷڞ༗ • Πϯγσϯτͷใࠂɾੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌείʔϓ֎ʹͳ͍ͬͯΔ 62
Slide 63
Slide 63 text
TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ త Incident Response Metrics ७ਮͳ෮چରԠͷ՝ಛఆɾվળࢦඪ Customer Reliability Metrics ސ٬ରԠͷ՝ಛఆɾվળࢦඪ Learning Metrics ৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷੳ => ࠓճɺCustomer Reliability Metrrics ͷྫΛհ 63
Slide 64
Slide 64 text
64
Slide 65
Slide 65 text
Log 65
Slide 66
Slide 66 text
ରԠதͷΠϕϯτΛه͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ அΛߦ͔ͬͨΛߏԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ • ׆༻ྫ • λΠϜϥΠϯੜεςʔλεϖʔ δΛࣗಈੜ ! 66
Slide 67
Slide 67 text
WaroomͷཪଆͰ४උ͕ਐߦத... 67
Slide 68
Slide 68 text
Trace 68
Slide 69
Slide 69 text
ରԠϓϩηεͷྲྀΕɺґଘؔ Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠ཧ • ΞΫγϣϯ͝ͱʹࡉԽͯ͠౷߹ • ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨఔΛಛఆ ✨ 69
Slide 70
Slide 70 text
πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog → AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ͍ͬͯΔͷରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰɺΑΓଟ͘ͷใ͕औಘՄೳʹʂ 70
Slide 71
Slide 71 text
AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰୡ͞Εͨ༰Λͱ ʹɺMCPαʔόʔ֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ τΛࣗಈతʹอଘͰ͖Δ 71
Slide 72
Slide 72 text
·ͱΊ 1. վળࢦඪͱͯ͠MTTRཱͨͳ͍ 2. ϝτϦΫε׆༻ɺతʙσʔλੳʹࢸΔ·Ͱͷ߹ੑ͕ॏཁ 3. มಈੑΛ͑ΔͨΊʹɺ͍ͷ۩ମԽͱϝτϦΫεͷࡉԽ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏ 5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72
Slide 73
Slide 73 text
͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷΈΛ࡞Δ ͷ͍ͨΜ • ͞ΒʹɺՄࢹԽج൫ͷߏங͍ͨΜ • ͞ΒʹɺϝτϦΫεΛΧςΰϦϥϕ ϧͰ෦நग़͢Δͷ͍ͨΜ • → ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ Topotal ͷϒʔε ʂ 73
Slide 74
Slide 74 text
͋Γ͕ͱ͏͍͟͝·ͨ͠