Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

גࣜձࣾTopotalʢͱΆͨΔʣ • h#ps:/ /topotal.com • SREΛओ࣠ʹͨ͠ελʔτΞοϓ • 2ࣄۀΛӡӦ • SRE as a Service • SaaS for SREʢWaroomʣ • ຊΠϕϯτͷεϙϯαʔ • ϒʔεͰSaaSͷσϞΛ΍ͬͯΔͷ Ͱɺͥͻཱ͓ͪدΓ͍ͩ͘͞ʂ 3

Slide 4

Slide 4 text

SRE as a Service • h#ps:/ /sre-as-a-service.com • SREʹಛԽٕͨ͠ज़ࢧԉαʔϏε • ࢧԉͷྫ • SLI/SLOͷಋೖɾӡ༻վળ • ΦϒβʔόϏϦςΟͷઃܭɾ࣮૷ • ΠϯγσϯτϚωδϝϯτͷվળ 4

Slide 5

Slide 5 text

WaroomʢΘΔʔΉʣ • h#ps:/ /waroom.com • ૊৫తʹΠϯγσϯτରԠΛߦ͏ͨΊ ͷSaaS • Slack ϕʔεͷରԠʹ߹ΘͤͯࣗಈԽɾ লྗԽ͕Ͱ͖Δ 5

Slide 6

Slide 6 text

6

Slide 7

Slide 7 text

7

Slide 8

Slide 8 text

8

Slide 9

Slide 9 text

ηογϣϯ֓ཁ • ΦϒβʔόϏϦςΟʢo11yʣʹΑΔվળޮՌͷྫͱͯ͠ɺΠϯγσϯ τϨεϙϯεʢIRʣͷվળ͕ڍ͛ΒΕΔ • ମײͰ͸վળͯͦ͠͏͕ͩɺͦͷޮՌΛఆྔతʹࣔ͢͜ͱ͸Ή͔͍ͣ͠ => IR SaaSͷ࡞Γख / SRE ͱͯ͠ɺIRΛఆྔతʹվળ͢ΔͨΊͷϓϥΫς Οεʢ࣮ફతͳTTXϝτϦΫεʣ ʹ͍͓ͭͯ࿩͠·͢ɻ => ऴ൫Ͱ͸ʢιϑτ΢ΣΞͰ͸ͳ͘ʣIRϓϩηεͷՄ؍ଌੑΛߴΊΔ ͱ ͍͏ςʔϚʹ΋౿ΈࠐΜͰ͓࿩͠·͢ɻ 9

Slide 10

Slide 10 text

ຊެԋͷλʔήοτ • o11y ͷվળޮՌΛఆྔతʹࣔ͢ϓϥΫςΟεʹڵຯ͕͋Δํ • IR ͷՄࢹԽʹڵຯ͕͋Δํ • ʮo11y Λ IR ͷྖҬ΁֦ு͢Δ͜ͱʯʹڵຯ͕͋Δํ 10

Slide 11

Slide 11 text

ΞδΣϯμ 1. Ϟνϕʔγϣϯ 2. MTTRͷ໰୊఺ 3. ࣮ફతͳ TTX ϝτϦΫεͷఆٛ 4. TTX ϝτϦΫεͷ׆༻ 5. o11y ΛΠϯγσϯτϨεϙϯεͷྖҬ΁ద༻͢Δ 11

Slide 12

Slide 12 text

1. Ϟνϕʔγϣϯ 12

Slide 13

Slide 13 text

໰͍: ͦͷԾઆ͸ຊ౰ͳͷ͔ 1. γεςϜͷՄ؍ଌੑΛվળ͢Δ 2. ෳࡶͳγεςϜͷ಺෦ঢ়ଶΛਪଌɾ೺ѲͰ͖ΔΑ͏ʹͳΔ 3. ໰୊ൃੜ࣌ʹݪҼಛఆ͕ਝ଎ʹͳΓ෮چ͕࣌ؒ୹͘ͳΔ ← ί Ϩ 13

Slide 14

Slide 14 text

஌Γ͍ͨ͜ͱ͸2͚ͭͩ • Where: Ͳ͜ ͕վળͨ͠ͷ͔ • How much: Ͳͷఔ౓ վળͨ͠ͷ͔ 14

Slide 15

Slide 15 text

ΦϒβʔόϏϦςΟʹΑͬͯߦΘΕͨ ΠϯγσϯτରԠͷվળޮՌΛ ఆྔతʹදݱ͍ͨ͠ 15

Slide 16

Slide 16 text

෮چ࣌ؒͷ୹ॖʹޮՌ͕͋Δ͸ͣ → MTTR Λଌఆ͢Ε͹͍͍ͷͰ͸ʁ 16

Slide 17

Slide 17 text

2. MTTRͷ໰୊఺ 17

Slide 18

Slide 18 text

MTTRʢฏۉ෮چ࣌ؒʣ ͱ͸ • ো֐͕ൃੜ͔ͯ͠Βम෮·ͨ͸෮چ͢ Δ·Ͱͷฏۉ࣌ؒͷ͜ͱ • Mean Time To Recovery(Repair, Resolve, Restore)ͷུ • ࢉग़ํ๏1 • MTTR = ૯मཧ࣌ؒ / ނোճ਺ 1 MTTRʢฏۉ෮چ࣌ؒʣͱ͸ʁܭࢉํ๏ͱMTBFͱͷނো཰ɾՔಇ཰ʹ ͓͚Δؔ܎ 18

Slide 19

Slide 19 text

19

Slide 20

Slide 20 text

SREs should move away from defaul/ng to the assump/on that MTTX can be useful. 20

Slide 21

Slide 21 text

MTTRͷ༗ޮੑͷݕূ • Ծઆ • MTTR͕༗ޮͳࢦඪͳΒɺTTRΛ୹ॖ͢Ε͹MTTR΋୹ॖ͞Ε Δ͸ͣ 21

Slide 22

Slide 22 text

MTTRͷ༗ޮੑͷݕূ 1. Πϯγσϯτͷσʔληοτ2ΛϥϯμϜʹ2෼ׂ͢Δ 2. ยํͷσʔληοτͷम෮࣌ؒ(TTR)Λ10%ݮΒ͢ 3. ֤σʔληοτͷMTTR(ฏۉम෮࣌ؒ)Λܭࢉ͢Δ 4. σʔληοτؒͷMTTRͷࠩ෼ΛऔΔ • diff = MTTR(unmodified)- MTTR(modified) 5. MTTRͷ୹ॖׂ߹(%)Λࢉग़͢Δ • = diff/MTTR(unmodified) 6. 1ʙ4Λ10ສճ܁Γฦ͢ 2 Unveiling the black box with observability stack 22

Slide 23

Slide 23 text

23

Slide 24

Slide 24 text

݁Ռ: MTTR͕10%Ҏ্վળ͢Δͷ͸໿50ʙ60% 24

Slide 25

Slide 25 text

֤ΠϯγσϯτΛվળͯ͠΋MTTR͕վળ͠ͳ͍ཧ༝ • MTTR͸෼෍ͷ࿪Έʹऑ͍ • ҰํͰɺΠϯγσϯτσʔλ͸"͹Β͖ͭ"͕ܹ͍͠ 25

Slide 26

Slide 26 text

Πϯγσϯτσʔλͷಛ௃3 • େ൒͸͔ͳΓૣ͘ऩଋ͢Δ • Ұ෦͸൵ࢂͳΠϯγσϯτʹͳΔ • → ແ࡞ҝʹσʔληοτΛ෼ׂ͢Δ ͱɺ൵ࢂͳΠϯγσϯτͷภΓ͕MTTR ͷࢉग़ʹେ͖ͳӨڹΛٴ΅͢ • ex. ෮چʹ5000ஹ͔͔࣌ؒΔΠϯγσ ϯτͷৼΓ෼͚ઌ͕ͲͪΒʹͳΔ͔Ͱ MTTRͷվળ۩߹͸େ෯ʹมΘΔ 3 The VOID Report 26

Slide 27

Slide 27 text

ࢀߟ: म෮࣌ؒΛมߋͤͣʹγϛϡϨʔγϣϯͨ݁͠Ռ → վળ׆ಈͷ༗ແʹ͔͔ΘΒͣɺMTTR͸σʔληοτ࣍ୈͰվળ or ѱԽ͢Δ 27

Slide 28

Slide 28 text

Incident Metrics in SRE ͷओு • γϛϡϨʔγϣϯ͔ΒΘ͔ͬͨ͜ͱ • Πϯγσϯτ͸ނোظؒͷ͹Β͖͕ͭେ͖͍ͨΊɺվળ͕ MTTR ʹ൓ө͞ΕͮΒ͍ • ex. ʮࡢ೥ൺMTTR10%վળʂʯ͸௕ظԽͨ͠Πϯγσϯτ͕গͳ͔͚͔ͬͨͩ΋ • ※ ຖ೥·ͬͨ͘ಉ͡ྔɾ෮چ࣌ؒͷΠϯγσϯτ͕ى͖ΔͳΒՁ஋͕͋Δ(ϜϦ) • ݁࿦ • MTTR ͸վળͷධՁࢦඪͱͯ͠໾ʹཱͨͳ͍ • MTTR͸෼෍ͷ࿪Έʹऑ͘ɺΠϯγσϯτσʔλ͸͹Β͖͕ܹ͍͔ͭ͠Β 28

Slide 29

Slide 29 text

ͳʹ͕໰୊ͩͬͨͷʁ ֤ཁૉ͸໰୊ͳ͍ • Πϯγσϯτظؒͷมಈੑ͕ߴ͍͜ͱ • MTTRΛͳΜΒ͔ͷࢦඪʹ͢Δ͜ͱ • ࢦඪΛ΋ͱʹվળͷ੒ՌΛ֬ೝ͢Δ͜ͱ → ໨తͱࢦඪ͕טΈ߹͍ͬͯͳ͍͜ͱ͕໰୊ 29

Slide 30

Slide 30 text

σʔλ෼ੳʢԾઆݕূܕʣͷྲྀΕ 30

Slide 31

Slide 31 text

MTTRΛࢦඪʹ͢Δͱ͖ͷࢥߟͷྲྀΕ 31

Slide 32

Slide 32 text

ى͖͍ͯͨ͜ͱ: ԾઆݕূϩδοΫͷෆ੔߹ 32

Slide 33

Slide 33 text

ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ཈͑Δ 33

Slide 34

Slide 34 text

ղܾࡦ: վળՕॴΛ໌Β͔ʹ͠ɺมಈੑΛ཈͑Δ 34

Slide 35

Slide 35 text

͜͜·Ͱͷ·ͱΊ • MTTR(෮چ࣌ؒ)͸σʔλมಈੑ͕ߴ͍ͨΊվળࢦඪʹ͸ෆద੾ • վળՕॴΛ໌֬Խ͠ɺΑΓࡉ͔͍ TTX ϝτϦΫεΛར༻͢Δ͜ ͱͰɺมಈੑΛ཈͑Δ͜ͱ͕Մೳ → TTRΑΓ΋ࡉ͔͍ϝτϦΫε΁ͷधཁ͕ग़ͯ͘Δ 35

Slide 36

Slide 36 text

3. ࣮ફతͳ TTX ϝτϦΫε 36

Slide 37

Slide 37 text

Waroom͕ߟ͑Δ࣮ફతͳϝτϦΫεͱ͸ • ໢ཏతͰ͋Δ • ཻ౓͕ࡉ͔͍ • ऩू͕ݱ࣮తͰ͋Δ 37

Slide 38

Slide 38 text

ͲΜͳTTXϝτϦΫεΛ ऩू͢ΔͱΑ͍ͩΖ͏͔ 38

Slide 39

Slide 39 text

39

Slide 40

Slide 40 text

TTXϝτϦΫε΁ͷ՝୊ײ • ੈͷதʹࣄྫ͸͍͔ͭ͋͘Δ͕ɺఆٛ͸౷Ұ͞Ε͍ͯͳ͍ • ࣄྫಉ࢜Λ૊Έ߹ΘͤΑ͏ͱͯ͠΋ɺॏෳ΍ෆ଍͕ੜ͡Δ • → ஶ໊ͳจݙΛϕʔεʹɺࡉ͔͘ɺ໢ཏతͳఆٛΛ໨ࢦ͢ 40

Slide 41

Slide 41 text

TTXϝτϦΫεఆٛͷྲྀΕ 1. ϕετϓϥΫςΟεΛֶͿ 2. ΠϯγσϯτεςʔλεΛఆٛ͢Δ 3. ΠϯγσϯτϚΠϧετʔϯ(εςʔλεͷڥ໨)Λఆٛ͢Δ 4. TTXϝτϦΫεΛఆٛ͢Δ 41

Slide 42

Slide 42 text

ϕετϓϥΫςΟεΛֶͿ 42

Slide 43

Slide 43 text

ΠϯγσϯτεςʔλεΛͬ͘͟Γఆٛ͢Δ 43

Slide 44

Slide 44 text

44

Slide 45

Slide 45 text

45

Slide 46

Slide 46 text

ϚΠϧετʔϯΛ΋ͱʹ TTXʹམͱ͠ࠐΉ 46

Slide 47

Slide 47 text

47

Slide 48

Slide 48 text

ϝτϦΫεऩू͸͍ͨ΁Μ • ࡉ͔ͳϝτϦΫεΛఆٛ͢ΔͱɺϚΠϧετʔϯΛ௒͑Δ͝ͱ ʹλΠϜελϯϓΛه࿥͢Δඞཁ͕͋Δ • ରԠதʹ͍͍ͪͪਓ͕ؒଧࠁ͢Δͷ͸ඇݱ࣮త • → Waroom Ͱ͸Slack BotͰࣗಈऩू͍ͯ͠·͢ 48

Slide 49

Slide 49 text

ରԠதͷΠϕϯτΛτϦΨʔʹࣗಈऩू͢Δྫ ϚΠϧετʔϯ ରԠதͷΠϕϯτ Detectedʢݕ஌ʣ Ξϥʔτൃੜ௨஌ Acknowledgedʢೝ஌ʣ νϟϯωϧ࡞੒ɺΠϯγσϯτىථ Iden.fiedʢղܾࡦͷಛఆʣ RunbookͷϑΣʔζ෼͚ʢPrecheck ͱResolu.onʣ Recoveredʢ෮چʣ Slackͷ΍ΓͱΓ͔ΒAI͕൑அ͢Δ 49

Slide 50

Slide 50 text

4. TTXϝτϦΫεͷ׆༻ 50

Slide 51

Slide 51 text

ϝτϦΫεΛޮՌతʹ࢖͏ͨΊʹ ෼ੳͷ໨తͱϝτϦΫεͷಛ௃Λ੔߹ͤ͞Δ 51

Slide 52

Slide 52 text

52

Slide 53

Slide 53 text

ϝτϦΫεͱվળࢪࡦͷྫ TTX ՝୊ վળࢪࡦ TTDetectʢݕ஌ʣ ൃੜ͔ͯ͠Βݕ஌·Ͱʹ࣌ ͕͔͔ؒΔ ϞχλϦϯάͷվળ TTEngageʢνʔϜߏ੒ʣ ରԠνʔϜΛߏஙʹ͕࣌ؒ ͔͔Δ γϑτ΍໾ׂͷ໌֬ԽɺΦ ϯίʔϧ੍౓ͷಋೖ TTInves-gateʢௐࠪʣ ো֐੾Γ෼͚ʹ͕͔͔࣌ؒ Δ RunbookͷμογϡϘʔυͷ ੔උ TTFixʢम෮ʣ ো֐ͷम෮ʹ͕͔͔࣌ؒΔ ϩʔϧόοΫͷߴ଎Խ 53

Slide 54

Slide 54 text

54

Slide 55

Slide 55 text

യવͱͨ͠ԾઆΛ΋ͱʹɺ܏޲͔Β՝୊Λݟ͚ͭΔ Ծઆ ৽ͨʹൃݟͨ͠՝୊ͷྫ ڞ௨ͷ؀ڥͳͷͰɺ૊৫಺ͷ֤ TTXͷ܏޲͸Ұఆͷ͸ͣ αʔϏε΍νʔϜʹΑͬͯύϑ ΥʔϚϯε͕ҟͳΔ ֤TTX͸૝ఆ஋ʹ͍ۙ͸ͣ ʢex. TTAͳΒ10෼Ҏ಺͘Β ͍ʣ ʢ࣮͸ʣணख͕શମతʹ஗͍ɺ ղܾࡦͷಛఆ͕શମతʹ஗͍ 55

Slide 56

Slide 56 text

56

Slide 57

Slide 57 text

57

Slide 58

Slide 58 text

5. o11y ΛΠϯγσϯτϨεϙϯεʹద༻͢ Δ 58

Slide 59

Slide 59 text

o11yΛIR΁ద༻͢Δ2 • ΠϯγσϯτϨεϙϯεͷ಺෦ߏ଄ͷ Մ؍ଌੑΛ͞ΒʹߴΊΔ • TTXͷఆٛʹΑͬͯɺϝτϦΫε͸ͳ Μͱͳ͘ಋೖࡁΈ • ϝτϦΫεɺϩάɺτϨʔεͷϓϥΫ ςΟεΛ׆༻͢Δ͜ͱͰվળͰ͖ͳ͍ ͩΖ͏͔ 2 Unveiling the black box with observability stack 59

Slide 60

Slide 60 text

Metrics 60

Slide 61

Slide 61 text

ബͬ͢Βͱ࢒Δ"ยखམͪ"ײ • ঺հͨ͠TTXϝτϦΫε͸ɺ͍ͣΕ΋TTRΛ෼ղ͚ͨͩ͠ • ͭ·ΓɺγεςϜ෮چ࣌ؒͷ୹ॖ ʹ͚ͩয఺͕౰͍ͨͬͯΔ • SREࢹ఺Ͱ͸ αʔϏεͷ৴པੑ ͷ؍఺͕ॏཁ • ex. ֶͼ͸͋Δ͔ɺ࠶ൃ๷ࢭ͸͞ΕΔ͔ • ϓϩμΫτӡӦࢹ఺Ͱ͸ ސ٬ͷ৴པੑ ͷ؍఺͕ॏཁ • ex. ސ٬ରԠ͸े෼ʹߦΘΕ͍ͯΔ͔ => Մ؍ଌੑΛߴΊΔʹ͸ɺΑΓଟ֯తͳରԠϓϩηεͷϝτϦΫε͕ඞཁ 61

Slide 62

Slide 62 text

γεςϜ෮چରԠͱฒߦͯ͠΍͍ͬͯΔ͜ͱ • ސ٬΁ͷઆ໌ɾࣄ৅ͷڞ༗ • Πϯγσϯτͷใࠂɾ෼ੳ • ࠜຊରࡦͷݕ౼ɾ࣮ࢪ => ݱঢ়ͩͱɺ্هͷ׆ಈͷ؍ଌ͸είʔϓ֎ʹͳ͍ͬͯΔ 62

Slide 63

Slide 63 text

TTXϝτϦΫεͷԠ༻: ؍ଌൣғͷ֦େ ؍ଌൣғΛΠϯγσϯτରԠશମʹ֦ு͠ɺվળࢦඪͱͳΔϝτϦΫεΛఆٛ͢Δ ϝτϦΫε໊ ໨త Incident Response Metrics ७ਮͳ෮چରԠͷ՝୊ಛఆɾվળࢦඪ Customer Reliability Metrics ސ٬ରԠͷ՝୊ಛఆɾվળࢦඪ Learning Metrics ૊৫ֶ͕ͼΛಘΔ·Ͱͷ׆ಈͷτϥοΩϯά Improvement Metrics ࠜຊରࡦͷ࣮ࢪঢ়گͷ෼ੳ => ࠓճ͸ɺCustomer Reliability Metrrics ͷྫΛ঺հ 63

Slide 64

Slide 64 text

64

Slide 65

Slide 65 text

Log 65

Slide 66

Slide 66 text

ରԠதͷΠϕϯτΛه࿥͢Δ • ऩू • ୭͕ɾ͍ͭɾͲͷίϚϯυɾͲͷ൑ அΛߦ͔ͬͨΛߏ଄ԽϩάԽ • ex. νϟοτɺεςʔλεมߋɺ֎෦ πʔϧʹΑΔΠϕϯτ࿈ܞ • ׆༻ྫ • λΠϜϥΠϯੜ੒΍εςʔλεϖʔ δΛࣗಈੜ੒ ! 66

Slide 67

Slide 67 text

WaroomͷཪଆͰ͸४උ͕ਐߦத... 67

Slide 68

Slide 68 text

Trace 68

Slide 69

Slide 69 text

ରԠϓϩηεͷྲྀΕɺґଘؔ ܎Λ؍ଌ͢Δ • ऩू • Πϯγσϯτεςʔλε୯ҐͰεύϯԽ • ݕ஌ʙ෮چ·ͰΛ1ຊͷτϨʔεͱͯ͠؅ཧ • ΞΫγϣϯ͝ͱʹࡉ෼Խͯ͠౷߹ • ׆༻ྫ • εςʔλεҠߦؒͰߦΘΕͨॲཧͱॴཁ࣌ ؒΛՄࢹԽ ! • ରԠͷϘτϧωοΫʹͳͬͨ޻ఔΛಛఆ ✨ 69

Slide 70

Slide 70 text

πʔϧ͕ԣஅ͢ΔதͰΠϕϯτΛͲ͏औಘ͢Δ͔ • ෮چରԠ࣌ʹ֤छπʔϧΛԣஅతʹར༻͢Δ͜ͱ͕ଟ͍ • ex. PagerDuty → Slack → Datadog → AWS → GitHub... • ݱঢ়ɺ୯ҰΠϯγσϯτͷͨΊʹߦͬͨ͜ͱΛ஌͍ͬͯΔͷ͸ରԠ ऀͷΈ • ରԠऀ͕खಈͰMELTΛอଘ͢Δͷ͸ඇݱ࣮త → AIϕʔεͰରԠΛ͢ΔੈքઢͰ͸ɺΑΓଟ͘ͷ৘ใ͕औಘՄೳʹʂ 70

Slide 71

Slide 71 text

AIϕʔεͷΠϯγσϯτϨε ϙϯε • AI͕ࣗવݴޠͰ఻ୡ͞Εͨ಺༰Λ΋ͱ ʹɺMCPαʔόʔ΍֎෦πʔϧͱ࿈ܞ ͠ͳ͕Β͞·͟·ͳૢ࡞Λߦ͏ • → ৗʹWaroomΛܦ༝ͯ͠ΞΫγϣϯ ͕ߦΘΕΔΑ͏ʹͳΓɺࡉ͔ͳΠϕϯ τΛࣗಈతʹอଘͰ͖Δ 71

Slide 72

Slide 72 text

·ͱΊ 1. վળࢦඪͱͯ͠MTTR͸໾ཱͨͳ͍ 2. ϝτϦΫε׆༻͸ɺ໨తʙσʔλ෼ੳʹࢸΔ·Ͱͷ੔߹ੑ͕ॏཁ 3. มಈੑΛ཈͑ΔͨΊʹ͸ɺ໰͍ͷ۩ମԽͱϝτϦΫεͷࡉ෼Խ͕ॏ ཁ 4. TTXϝτϦΫεͷఆٛաఔͱ׆༻ํ๏ 5. o11yͷϓϥΫςΟεΛ࣋ͪࠐΉ͜ͱͰɺΑΓแׅతͳ؍ଌʹۙͮ͘ 72

Slide 73

Slide 73 text

͍͞͝ʹ • ϝτϦΫεͷࣗಈऩूͷ࢓૊ΈΛ࡞Δ ͷ͸͍ͨ΁Μ • ͞ΒʹɺՄࢹԽج൫ͷߏங͸͍ͨ΁Μ • ͞ΒʹɺϝτϦΫεΛΧςΰϦ΍ϥϕ ϧͰ෦෼நग़͢Δͷ΋͍ͨ΁Μ • → ͥͻ Waroom Λ͝׆༻͍ͩ͘͞ • ڵຯ͕༙͍ͨํ͸ Topotal ͷϒʔε ΁ʂ 73

Slide 74

Slide 74 text

͋Γ͕ͱ͏͍͟͝·ͨ͠