Slide 1

Slide 1 text

Creating "Awesome Change" In SmartNews! ʙಛ຿෦ୂz"$5zͰ௅ΜͩΠϯγσϯτ൒ݮ࡞ઓʙ April 15, 2025

Slide 2

Slide 2 text

Who am I? / ͓·ͩΕ Ikuo Suyama / ಃࢁҭஉ • Staff Engineer • Ads Backend Expert • Nov. 2020~ SmartNews, Inc. • Interest: Fishing, Camping, Gunpla, Anime

Slide 3

Slide 3 text

3 ࠓ೔͸ Incident(ো֐) … ͷ࿩Λ͠·͢ʂ

Slide 4

Slide 4 text

4 ͪΐͬͱ૝૾ͯ͠Έ͍ͯͩ͘͞… ேग़ࣾͯ͠ɺ ”͡Όɺࠓ೔͔ΒΠϯγσϯτݮΒ͍ͯͩ͘͠͞” …ͬͯݴΘΕͨΒɺΈͳ͞ΜͳΒԿ͔Β࢝Ί·͔͢ʁ ࠓ೔͸ͦΜͳ͓࿩

Slide 5

Slide 5 text

5 ࠓ೔͓࿩͢͠Δ͜ͱ 1. ΠϯγσϯτରԠͷݱ৔ͰಘΒΕͨܦݧͱ஌ݟ 2. Πϯγσϯτͷ෼ੳͱɺ͔ͦ͜Βߦͬͨରࡦ 3. ૊৫ʹ౷ҰϓϩηεΛਁಁͤ͞Δࢪࡦ Disclaimer 1: ڧίϯςΩετґଘɺN=1ͷܦݧஊʂ Πϯγσϯτͱઓ͏ಛ຿νʔϜͰͷ൒೥ؒͷܦݧ

Slide 6

Slide 6 text

6 ࠓ೔͓࿩͠͠ͳ͍ʢͰ͖ͳ͍ʣ͜ͱ 1.DevνʔϜͱOpsνʔϜͷڠۀ → DevνʔϜ͕ӡ༻΋ো֐ରԠ΋΍͍ͬͯΔલఏ 2.SRE, DevOps ͷ”ϕετϓϥΫςΟε”ͷదԠ → ͋͘·Ͱݱ৔ͰಘΒΕͨ஌ܙͱܦݧ Disclaimer 2: SREͷϓϩͰ΋ɺDevOpsͷϓϩͰ΋͋Γ·ͤΜ

Slide 7

Slide 7 text

1.࢝ಈ: Assemble! ಛघ෦ୂ “ACT”! 2.ॳಈ: “Get our hands dirty”! 3.༂ਐ: Incident Λ൒෼ʹ͢Δ!? 4.ؼؐ: ࢒͞Εͨ՝୊ͱ͜Ε͔Β Agenda

Slide 8

Slide 8 text

01 ࢝ಈ: Assemble! ಛघ෦ୂ “ACT”!!

Slide 9

Slide 9 text

9 1-1. ࢝·Γ ͋Ε͸݄̕ͷ͜ͱͩͬͨ… Incident ͕ଟ͗͢Δʂ Incident Λ൒ݮ͢Δ λεΫϑΥʔεΛ࡞Δͧʂ Awesome Change Team “ACT” ͩʂ CTO

Slide 10

Slide 10 text

10 1-1. ࢝·Γ … ͦΕ͸͋ͳͨͷࢦࣔͰେྔͷมߋΛ ౤ೖ͍ͯ͠Δ͔ΒͰ͸…ʁ ΅͘ɿ

Slide 11

Slide 11 text

11 1-1. ࢝·Γ … ͦΕ͸͋ͳͨͷࢦࣔͰେྔͷมߋΛ ਪ͠ਐΊ͍ͯΔ͔ΒͰ͸…ʁ ΅͘ɿ ͪ ΐ ͬ ͱ ଴ ͯ͌ ʂ

Slide 12

Slide 12 text

12 1-1. ࢝·Γ • CTO) Πϯγσϯτ͸ຊ౰ʹ “ଟ͍” ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ͬͯͲ͏ఆٛ͢Δʁ • Ikuo) มߋ͸ຊ౰ʹଟ͍ͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͕ଟ͍ݪҼ͸มߋͳͷ͔ʁ • มߋͱ͸Կͷมߋͳͷ͔ʁ ͜ͷ࣌఺Ͱ͸૒ํࠜڌͷͳ͍ɺ”Χϯ” ͪΐͬͱ଴͍ͯʂ ※ͨͩ͠γχΞΤϯδχΞͷᄿ֮͸෠Εͳ͍

Slide 13

Slide 13 text

13 1-2. ࠷ڧνʔϜΛूΊΔ ൒೥ؒͷظݶΛ෇͚ͯɺ τοϓϓϥΠΦϦςΟͰ “࠷ڧνʔϜ” Λটूʂ τοϓμ΢ϯͷར఺ ٕज़෦໳Top௚ʑͷϓϩδΣΫτɻ

Slide 14

Slide 14 text

14 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core System (Infra) Mobile SmartView (Article)

Slide 15

Slide 15 text

15 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! VPoE K! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ (Manager) CTO Report To

Slide 16

Slide 16 text

16 1-2. ࠷ڧνʔϜΛूΊΔ ֤Division͔ΒΤʔεୡ͕ू·Δ…ʂ Ads News Ranking Push Notification Core System (Infra) Mobile SmartView (Article) Ads Ikuo! News & Push D! Ranking R! CoreSystem T! Mobile M! SmartView T! ※ ࿩ͷ౎߹্ࣗ෼ͷ͜ͱ΋Τʔεͱݺ͹͍͓ͤͯͩ͘͞ئ͍͠·͢ VPoE K! (Manager) CTO Report To SREŧŔŕŪũƄŝſ

Slide 17

Slide 17 text

17 1-2. ࠷ڧνʔϜΛूΊΔ ʮ൒೥Ͱઈରʹ੒ՌΛग़͞ͳ͚Ε͹ͳΒͳ͍ʯ ͱ͍͏ڧ྽ͳϓϨογϟʔ… શνʔϜ͔ΒΤʔεڃΛҾͬ͜ൈ͍͖͍ͯͯΔ͜ͱͰɺ ૊৫ͷ֮ޛɺॏཁੑ͕ࣔ͞Ε͍ͯΔ …ͱಉ࣌ʹɺݴ͍༁͕ޮ͔ͳ͍ τοϓμ΢ϯͷ೉఺

Slide 18

Slide 18 text

18 1-3. νʔϜΛํ޲͚ͮΔ ᐆດͰ౴͕͑ͳ͍໰୊΁ͷ௅ઓ • “ΠϯγσϯτΛݮΒ͢” ͱ͍͏Ұݟγϯϓϧ͕ͩ޿େ ͳ໰୊ۭؒ • Ͳ͔͜ΒखΛ෇͚Δ͔ʁ • Կ͕໰୊ͰͲΜͳରࡦ͕༗ޮͳͷ͔ʁ • ͦ΋ͦ΋Πϯγσϯτ͸ଟ͍ͷʢry

Slide 19

Slide 19 text

19 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳΰʔϧઃఆ • “Awesome Change” ͱ͸ • ΫϦςΟΧϧͳΠϯγσϯτΛݮΒ͢ • SREϕετϓϥΫςΟεΛ૊৫ʹΠϯετʔϧ͢Δ • վળର৅KPI: • Mean Time Between Failure(MTBF) / Change Failure Rate(CFR) ʹΠϯγσϯτ਺ • Mean Time to Recover(MTTR) ʹΠϯγσϯτղܾ࣌ؒ “զʑ͸ͳͥ͜͜ʹ͍Δͷ͔” ͷݴޠԽʂ VPoE͕͏·͘΍ͬͯ͘Ε·ͨ͠

Slide 20

Slide 20 text

20 1-3. νʔϜΛํ޲͚ͮΔ ໌֬ͳϓϥΠΦϦςΟઃఆ • P0: ΠϯγσϯτϋϯυϦϯάΛαϙʔτ͢Δ • P1: ΫϦςΟΧϧ͔ͭফԽ͞Ε͍ͯͳ͍ΠϯγσϯτΞ ΫγϣϯΞΠςϜΛ௵͢ • P2: ΠϯγσϯτൃੜΛ๷͙ࠜຊతͳγεςϜվળ ؟ͷલ΍Δ͜ͱ͸໌֬ʂ

Slide 21

Slide 21 text

02 ॳಈ: ”Get our hands dirty!”

Slide 22

Slide 22 text

22 2-1. P0: ΠϯγσϯτϋϯυϦϯάͷαϙʔτ Get our Hands Dirtyɿ͢΂ͯͷΠϯγσϯτʹհೖ͢Δʂ • Πϯγσϯτ͕ى͜ΔͱɺͱΓ͋͑ͣACTϝϯόʔͷͩΕ͔ͷ PagerDuty͕໐Δ • ݁ہACTશһΛΠϯγσϯτ͕ى͍ͬͯ͜Δͱ͜Ζʹট଴͢Δ • ࣗ෼ͷग़਎υϝΠϯͰ͋Ε͹ফՐ׆ಈʹࢀՃ͢Δ • ͦ͏Ͱͳͯ͘΋ɺεςʔλεΞοϓσʔτ΍ඞཁͳਓࡐͷ֬อɺ Ϗδωεͱͷ࿈བྷ໾ͳͲΛങͬͯग़Δ ΩπΠ!!

Slide 23

Slide 23 text

23 2-1. P0: ΠϯγσϯτϋϯυϦϯάͷαϙʔτ ΞϯνύλʔϯɿPager Monkey l43&͕͢΂ͯͷΦϯίʔϧΛॲཧ͢Δɺͭ·Γ൴Β͸ࠓɺ αʔϏε͕μ΢ϯͨ͠ޕલ࣌ʹεΫϦϓτʹै͏͜ͱ͕࢓ࣄͷɺ ϖʔδϟʔϞϯΩʔͱͳΔz Š43&Λ࢝ΊΑ͏ষ43&ͷจԽ ౰વ Bad Practice ͢΂ͯͷΠϯγσϯτʹհೖͯ͠΋Πϯγσϯτ͸ࢭ·Βͳ͍…

Slide 24

Slide 24 text

24 2-1. P0: ΠϯγσϯτϋϯυϦϯάͷαϙʔτ Ξϯνύλʔϯ… ͚ͩͲɺѱ͍͜ͱ͹͔Γ͡Όͳ͍ Πϯγσϯτ = ACTͱ͍͏ୈҰ૝ى ͦͯ͠৴པஷۚΛಘͨʂ ACT͸Πϯγσϯτʹհೖ͢Δ Πϯγσϯτͷͱ͖͸ACT͕ॿ͚ͯ͘ΕΔ ACT࢓ࣄͯ͠Δʂʂ

Slide 25

Slide 25 text

25 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ ๨ΕڈΒΕͨΞΫγϣϯΞΠςϜͨͪ ؒҧ͍ͳ͘ରԠ͞Εͣɺ๨ΕڈΒΕ͍ͯΔ΍ͭΒ͕͍Δ • ΋ͱ΋ͱΠϯγσϯτϨϙʔτΛ࢒͢จԽ͕͋ͬͨ • ࠶ൃ๷ࢭͷΞΫγϣϯΞΠςϜ΋هࡌ͞Ε͍ͯͨ • ͢͹Β͍͠ʂʂ • ͕ɺΞΫγϣϯΞΠςϜ͸؅ཧ͞Ε͍ͯͳ͔ͬͨ • ୲౰ɺظݶɺ׬ྃεςʔλε • !!??!??! • ͔ͭɺϨϙʔτͷϑΥʔϚοτ͸Division͝ͱʹҟͳͬͨ • ͳΜͳΒ୲౰ऀ͝ͱʹҟͳͬͨ

Slide 26

Slide 26 text

26 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ ๨ΕڈΒΕͨ΍ͭΒΛҰཡԽ͢Δ • AIΛJIRAͰνέοτొ࿥ͯ͠ɺঢ়ଶ؅ཧ͍ͨ͠ʂ • ͋ΘΑ͘͹ϦϚϯμ͍ͨ͠ • ͔͠͠ɺϨϙʔτ͝ͱʹશ͘ҟͳΔϑΥʔϚοτ… • Ͳ͏͢Δʁ ॿ͚ͯChat GPT… ŷžŕţ

Slide 27

Slide 27 text

27 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ Get our Hands DirtyɿखಈͰσʔλ੔ཧ ڭ܇̍: σʔλ͸ͳΔ΂͘ ػցॲཧͰ͖ΔϑΥʔϚοτͰ࢒ͤʂʂ ͤ΍ʂաڈ̍೥෼ͷΠϯγσϯτϨϙʔτͷAI શ෦खಈͰNotion DatabaseʹҠߦͨ͠Ζʂ ※Databaseʹͯ͠͠·͑͹APIܦ༝Ͱσʔλ͕औΕΔͷͰͲ͏ʹͰ΋ͳΔ ڭ܇̎ɿ໨తͷͨΊʹటष͍खஈΛऔΔ͜ͱΛԀ͏ͳʂʂ Get our Hands Dirty

Slide 28

Slide 28 text

28 2-2. P1:ΞΫγϣϯΞΠςϜΛ௵͢ Get our Hands DirtyɿऴΘͬͯͳ͍΍ͭΛͻͨ͢ΒରԠ͢Δ ͜ΕऴΘͬͱΒΜ΍Μ͚ʂ ௵ͤʂʂʂ ڭ܇̏:ΈΜͳ๩͍͠ɻཔΜͰ΋΍Βͳ͍ͷͰ͸ͳ͘΍Εͳ͍ ࣗ෼ͨͪͷखΛಈ͔ͤʂʂ Get our Hands Dirty

Slide 29

Slide 29 text

29 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ • Division͕ҧ͑͹ΠϯγσϯτରԠͷ΍Γํɺྲّྀ΋ҟ ͳΔ • աڈʹ΋શࣾ౷ҰͷϓϩτίϧΛ࡞ΔࢼΈ͕͋ͬͨ • ”Incident Response Framework:IRF” • ͕ɺਁಁ/ར༻͍ͯ͠ͳ͔ͬͨ • ಛఆͷDivisionͷཁ͔݅͠ߟྀ͞Εͯͳ͔ͬͨʂ ͦ΋ͦ΋ͳͥશࣾͰ౷Ұ͞Εͨϓϩηε͕ͳ͔ͬͨʁ

Slide 30

Slide 30 text

30 • IRFࣗମ͸ϓϩηεͱͯ͠͸Α͘Ͱ͖͍ͯͨ • ͜ΕΛϕʔεʹɺ • શࣾͰ౷ҰͰ͖Δखॱ … ֤Τʔε͔ΒͷυϝΠϯ஌ࣝͱܦݧͷ౤ೖ • ͔ͭܰྔͳ΋ͷ • Πϯγσϯτͷ࠷தʹෳࡶͳखॱ͸଍ΛҾͬுΔʂ • ެ։͞Ε͍ͯΔଞࣾͷFW΋ࢀߟʹɺྑ͍ͱ͜ΖΛऔΓೖΕͨ • e.g. Pager Duty Incident Response Ͳ͏΍ͬͯશࣾ౷ҰϓϩηεɺϑϨʔϜϫʔΫΛ࡞Δʁ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ͢΂ͯͷྖҬΛΧόʔͨ͠νʔϜ͔ͩͬͨΒͦ͜Մೳ

Slide 31

Slide 31 text

31 IRF 2.0 Contents 1. Role, Playbook 2. Severity Definition 3. Workflow 4. Communication Guideline 5. Incident Report Template, Postmortem ॏཁͳͱ͜ΖΛ঺հ͠·͢ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ৄࡉ͸εϥΠυΛ֬͝ೝ͍ͩ͘͞ʂ

Slide 32

Slide 32 text

32 IRF 2.0: Role, Playbook • On-Call Engineer • ΦϯίʔϧΛड͚ΔΤϯδχΞɻΞϥʔτͷτϦΞʔδΛߦ͍ɺඞཁͰ͋Ε͹ICʹΤε ΧϨʔγϣϯͯ͠IRFΛ։࢝͢Δ(IncidentΛએݴ͢Δ)ɻ • Incident Commander(IC) • ΠϯγσϯτରԠͷࢦشΛͱΔਓɻඞཁͳਓΛूΊɺ৘ใΛ੔ཧ͢Δɻ֎෦ͱͷίϛϡ χέʔγϣϯʢCLʣΛ݉຿͢Δ͜ͱ΋͋Δɻ௨ৗTech Lead/Engineering Managerɻ • ࣮ࡍͷՐফ͠࡞ۀͰ͸ͳ͘ɺ৘ใɾঢ়گ੔ཧͱ൑அ͕੹຿ • Responder • ࣮ࡍͷՐফ͠࡞ۀʢϩʔϧόοΫ΍ઃఆมߋʣΛߦ͏ɻ • Communication Lead(CL) • ֎෦εςʔΫϗϧμʔʢ͜͜Ͱ͸ΤϯδχΞҎ֎ʣͱͷίϛϡχέʔγϣϯΛ୲౰͢ Δɻ ICͱResponderͷ੹೚෼཭͕ΩϞ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 33

Slide 33 text

33 IRF 2.0: Severity Definition IC͕Πϯγσϯτએݴ࣌ʹ࢑ఆతʹܾఆ͢Δɻ࠷ऴධՁ͸ϙετϞʔςϜͰܾ ·Δ • 🔥 SEV-1 • χϡʔεߪಡͳͲίΞUXػೳ͕׬શఀࢭ • 🧨 SEV-2 • ίΞUXػೳͷҰ෦ఀࢭɺαϒUXػೳͷ׬શఀࢭ • 🕯 SEV-3 • αϒUXػೳͷҰ෦ఀࢭ ॳಈͷ࣌఺ͰSEVʹ౰ͨΓΛ͚͓ͭͯ͘͜ͱ͕؊ཁɻ ʢγϏΞͳΠϯγσϯτ͸ΑΓૣ͘ղܾ͍ͨ͠ʣ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 34

Slide 34 text

34 IRF 2.0: Workflow Πϯγσϯτ͕࢝·͔ͬͯΒऴΘΔ·ͰͷྲྀΕɻ🩸 : Bleeding, ग़݂தͷεςʔλε 1. 🩸 Occurrence/ൃੜ • ໰୊ͱͳΔࣄ৅ͷൃੜɻσϓϩΠ΍ઃఆมߋͳͲ͕τϦΨʔ 2. 🩸 Detection/ݕ஌ • ΞϥʔτͳͲʹΑΓɺOnCaller͕໰୊Λݕ஌ͨ͠ঢ়ଶɻτϦΞʔδΛ։࢝ 3. 🩸 Declaration/એݴ • Πϯγσϯτͷ ”એݴ”ɻIRFʹଇΓɺICͷࢦشͷ΋ͱࢭ݂ରԠ։࢝ • ಉ࣌ʹඞཁͳ֎෦ίϛϡχέʔγϣϯΛ։࢝ɻग़݂த͸ܧଓతͳΞοϓσʔτ 4. ❤🩹 Mitigation/؇࿨ • มߋͷϩʔϧόοΫͳͲͰҰ࣌ݪҼΛഉআɺඃ֐ͷ֦ࢄΛఀࢭ 5. Resolution/ղܾ • ෆ۩߹ͷमਖ਼΍σʔλิਖ਼ͳͲɺ߃ٱରԠͷ׬ྃɻ׬શࢭ݂ 6. Postmortem/ࣄޙ෼ੳ • ΠϯγσϯτϨϙʔτΛݩʹɺࠜຊݪҼͷٹ໋ͱ࠶ൃ๷ࢭࡦͷݕ౼ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 35

Slide 35 text

35 IRF 2.0: Communication Guideline ίϛϡχέʔγϣϯʹ࢖͏৔ॴʢSlackͷνϟϯωϧʣͷఆٛ • #incident • શମ΁ͷεςʔλεप஌ɺ֎෦εςʔΫϗϧμʔͱͷίϛϡχέʔγϣϯ • #incident-irf-[incidentId]-[title] • ໰୊ղܾͷͨΊͷٕज़తͳίϛϡχέʔγϣϯɻؔ࿈͢Δ৘ใɾٞ࿦͢΂ͯूΊΔ • ඞཁʹԠͯ͡WAR ROOM(Online, Google Meet)Λཱͯͯू߹ ٞ࿦΍৘ใ͕̍ͭͷνϟϯωϧʹू·ͬͯΔͱϨϙʔτੜ੒࣌ʹศར 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 36

Slide 36 text

36 IRF 2.0: Incident Report Template & Postmortem શࣾ౷ҰϑΥʔϚοτͷదԠ • Summary • Impact • Direct Cause, Mitigation • Root Cause Analysis (5-whys) • ௚઀ݪҼͱࠜຊݪҼΛ෼͚ͯ෼ੳ͢Δ͜ͱ͕ॏཁʂ • ͜͜ʹରͯ͠ΞΫγϣϯΞΠςϜΛઃఆ͠ɺ࠶ൃ๷ࢭ • Action Items • Timeline • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!! DivisionࣄʹҟͳͬͨςϯϓϨʔτΛ౷Ұʢ͍ͩ͡ʣɺPostmortemͷҰݩԽ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 37

Slide 37 text

37 ࡞ͬͨ͸͍͍͕ɺͲ͏΍ͬͯਁಁͤ͞Δʁ ͜͜ʹૉ੖Β͍͠IRF2.0͕͋Γ·͢ɻ ࠷ߴ͔ͩΒΈΜͳ͜ΕಡΜͰ΍ͬͯͶ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 38

Slide 38 text

38 ࡞ͬͨ͸͍͍͕ɺͲ͏΍ͬͯਁಁͤ͞Δʁ ͜͜ʹૉ੖Β͍͠IRF2.0͕͋Γ·͢ɻ ࠷ߴ͔ͩΒΈΜͳ͜ΕಡΜͰ΍ͬͯͶ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ Ͱ ͸ ͳ ͍ ʂ

Slide 39

Slide 39 text

39 Get our Hands Dirtyɿ͢΂ͯͷΠϯγσϯτʹհೖ͠ɺ ແཧ΍ΓIRF2.0ϓϩηεΛదԠ͢Δ Ͳ͏΋͜Μʹͪ͸IRF͓͡͞ΜͰ͢ ͡Ό͋๻͕ΠϯγσϯτίϚϯμʔ΍Γ·͢Ͷʂ Έͳ͞Μ͸Րফ͠ʹूத͍ͯͩ͘͠͞ʂʂ ڭ܇̐:ۓٸ࣌ʹ৽͍͠ϓϩτίϧΛֶΜͰ͍Δ༨༟ͳͲͳ͍ʂ ࣮ફ͋ΔͷΈʂʂ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ ڭ܇̑:ࣗ෼ୡͰ·ͣ࢖͑ʂϑΟʔυόοΫϧʔϓΛճͤʂ

Slide 40

Slide 40 text

40 ϓϩάϥϚͷࡾେඒಙɿଵଦ /incident slackίϚϯυΛ࢖͍ͬͯͩ͘͞ɻ ޙ͸શ෦๻͕΍Γ·͢ Πϯγσϯτνέοτͷੜ੒ɺઐ༻νϟϯωϧ࡞੒ɺ ֤छPlaybook΁ͷϦϯΫ౤ߘΛࣗಈԽ ͏·͍ͬͨ͘΋ͷΛࣗಈԽ͢Δɺͱ͍͏ྲྀΕ͕͍ͩ͡ʂ 2-3. P2: ࠜຊվળ ౷ҰΠϯγσϯτରԠϓϩηεಋೖ

Slide 41

Slide 41 text

41 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ ॏཁͳؾ͖ͮ ΍ͬͱΔͳʂ Πϯγσϯτ͸ݮͬͨͷ͔ʁ MTTR͸Ͳ͏ͳͬͨʁ CTO ͑͑ͱ… ΅͘

Slide 42

Slide 42 text

42 ॏཁͳؾ͖ͮɿKPI͕௥͑ͯͳ͍ʂ ͍ͯ͏͔ࠓ݄ Կ݅Πϯγσϯτॲཧͨ͠ʁ ઌ݄͸…ʁ Ұ݅ͲΕ͘Β͍Ͱ ղܾͰ͖ͨʁ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 43

Slide 43 text

2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ 43 ॏཁͳؾ͖ͮɿKPI͕௥͑ͯͳ͍ʂ ͍ͯ͏͔ࠓ݄ Կ݅Πϯγσϯτॲཧͨ͠ʁ ઌ݄͸…ʁ Ұ݅ͲΕ͘Β͍Ͱ ղܾͰ͖ͨʁ ʹ ΋ Θ ͔ Β ͳ ͍!!

Slide 44

Slide 44 text

44 σʔλΛݟΑ͏ɿඞཁͳ͜ͱ σʔλ ऩू ՄࢹԽ ͜ΕΛ͜͏ͯ͜͠͏͡Ό 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 45

Slide 45 text

45 σʔλΛݟΑ͏ɿඞཁͳ͜ͱ σʔλ ऩू ՄࢹԽ σʔλఆ͕ٛ؊ʂ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 46

Slide 46 text

σʔλఆٛɿΠϯγσϯτΛϞσϦϯά͢Δ • Insidentͷଐੑ • Title • Status • State Machine. ޙड़ • Severity • SEV 1~3(IRF2.0) • Direct Cause • ޙड़ • Direct Cause System • MicroServiceͷҙຯ୯ҐͰͷίϯϙʔωϯτ܈ • Direct Cause Workload • Online Service, Offline Pipeline, … ՄೳͳݶΓEnumΛఆٛ͢Δ 46 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ ࣗ༝ೖྗͰ͸ΧʔσΟφϦςΟ͕ߴ͘ͳΓ͗ͯ͢·ͱ΋ͳ෼ੳ͕Ͱ͖ͳ͍

Slide 47

Slide 47 text

ΠϯγσϯτϞσϦϯάɿDirect Cause ΠϯγσϯτͷݪҼɻ෼ੳͯ͠ΞΫγϣφϒϧʹͳΔΑ͏ઃఆ 47 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 48

Slide 48 text

ΠϯγσϯτϞσϦϯάɿStatus -- State Machine Πϯγσϯτ͕ͲͷΑ͏ͳঢ়ଶભҠΛͨͲΔ͔ɻ εςʔτϚγϯͱͯ͠ఆٛɺ੔ཧ͢Δͱɺ ֤ঢ়ଶͷભҠ࣌ؒͰऔΓ͍ͨ࣌ؒࢦඪ͕ఆٛͰ͖Δʂ 48 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 49

Slide 49 text

σʔλऩूɿΠϯγσϯτϨϙʔτ ઌʹఆٛͨ͠σʔλ߲໨ΛؚΊΔΑ͏ɺΠϯγσϯτϨϙʔτͷςϯ ϓϨʔτΛΞοϓσʔτ • ඞཁͳ߲໨ΛඞਢೖྗͷAttributeʹ • EventTimelineΛೖྗ͢ΔNotionDatabase Λ௥Ճ • State͕มԽͨ࣌ؒ͠Λه࿥ͯ͠΋Β͏ • ػցॲཧͰ͖ΔϑΥʔϚοτͰ!!!!! σʔλఆ͕͔ٛͬ͠Γ͍ͯ͠Ε͹ɺιʔε͸ϑϨΩγϒϧ (΋ͪΖΜ৴པͰ͖ΔσʔλͰ͋Δલఏ) 49 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 50

Slide 50 text

ޙ͸؆୯ɿ͜͏ͯ͜͠͏ͯ͜͠͏͡Ό 50 ChatGPT͕Ұ൩Ͱ΍ͬͯ͘Ε·ͨ͠ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 51

Slide 51 text

ΠϯγσϯτμογϡϘʔυ:ॏཁͳࢦඪΛՄࢹԽ͢Δ Ϥγʂ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 52

Slide 52 text

2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ ΠϯγσϯτμογϡϘʔυ:ॏཁͳࢦඪΛՄࢹԽ͢Δ Ϥγʂ ͪ ΐ ͬ ͱ ଴ ͯ͌ ʂ

Slide 53

Slide 53 text

աڈσʔλΛͲ͏͢Δʁ IRF2.0(શࣾ౷ҰϑΥʔϚοτ)ҎલͷϨϙʔτ • DivisionຖʹҟͳΔϑΥʔϚοτ • λΠϜϥΠϯՕ৚ॻ͖ɺσʔλܽଛɺ… ̏ϲ݄΋଴ͯ͹े෼σʔλཷ·Γ·͢ΑͶ ͬͪ͜͸൒೥͔͠ͳ͍ΜͩΑʂʂ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 54

Slide 54 text

աڈσʔλΛͲ͏͢Δʁ΋ͪΖΜGet our Hands Dirty! ͤ΍ʂաڈ̍೥෼ͷΠϯγσϯτϨϙʔτ શ෦खಈͰ৽ϑΥʔϚοτʹҠߦͨ͠Ζʂʂ Re) ڭ܇̎ɿ໨తͷͨΊʹటष͍खஈΛऔΔ͜ͱΛԀ͏ͳʂʂ ̍ि͔͚ؒͯগͣͭ͠ɺνʔϜશһͰख෼͚ͯ͠ҠߦΛ׬਱ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ Get our Hands Dirty

Slide 55

Slide 55 text

ΠϯγσϯτμογϡϘʔυ:ॏཁͳࢦඪΛՄࢹԽ͢Δ Ϥγʂ Ϥγʂ 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 56

Slide 56 text

ॏཁͳࢦඪͷ؍ଌɿMTTRͷ؍ଌͱࡉ෼Խ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Mitigate Ͳ͜ʹ͕͔͔͍࣌ؒͬͯΔ͔ɺݱࡏ஍͕Θ͔ͬͨʂ Time To Resolve 2-4. P2: ࠜຊվળ Πϯγσϯτͷղ૾౓ΛߴΊΔ

Slide 57

Slide 57 text

༂ਐ: Incident Λ൒෼ʹ͢Δ!? 03

Slide 58

Slide 58 text

58 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ ΠϯγσϯτʹΑΔΠϯύΫτ e.g. Revenue, Reputation, Effectiveness…. Ͱ͢ΑͶʁ ಛʹRevenue Loss… ຊ౰ʹݮΒ͍ͨ͠ͷ͸ ݮΒ͍ͨ͠ͷ͸ Πϯγσϯτʁ🤔

Slide 59

Slide 59 text

59 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity Factor (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × Σ toC, ޿ࠂϏδωεͩͱ͍͍ͩͨ͜ΕͰRevenueΠϯύΫτ͕ܾ·Δ • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ

Slide 60

Slide 60 text

Σ 60 3-1. ΠϯγσϯτΛݮΒ͢ͱ͸ʁ MTTD + MTTM (ࢭ݂ʹ͔͔ͬͨ࣌ؒ) ΠϯγσϯτͷΠϯύΫτΛݟੵ΋Δ Severity Factor (ΠϯγσϯτͷӨڹ౓߹͍) Πϯγσϯτ਺ × • ͳΔ΂͘୹͍ͨ͘͠ • վળ͕ൺֱత༰қɺख͕͚ͭ΍͍͢ • ͳΔ΂͘େ͖ͳΠϯγσϯτΛݮΒ͍ͨ͠ • ͕ɺίϯτϩʔϧ͕೉͍͠ • ͳΔ΂͘਺ΛݮΒ͍ͨ͠ • த௕ظͷ׆ಈ͕ඞཁ ͔͜͜ΒखΛ෇͚͍ͯ͘ ACT݁੒౰ॳʹཱͯͨKPIͱ΋Ϛον͍ͯ͠Δʂ ͕ɺ਺ϲ݄ͷܦݧͰղ૾౓͕ΑΓ্͕ͬͨ

Slide 61

Slide 61 text

61 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν MTTR ΛݮΒ͢ʹ͸ʁ Ϛδ͔ ڭ܇̒:ѹ౗తΤʔεΛ࣋ͬͯΠϯγσϯτʹհೖ͢Δͱ ղܾ͕ૣ͘ͳΔ(͔΋?)

Slide 62

Slide 62 text

MTTRͷղ૾౓Λ্͛ΔɿεςʔτϚγϯ 1. Occurred 2. Detected 3. Declared 4. Mitigated 5. Resolved Time To Detect Time To Resolve Time To Mitigate ͦΕͧΕॏཁ౓ɺରࡦ͕ҟͳΔʂ ݕ஌ʹ͔͔Δ࣌ؒɻ ओʹAlertingͷྖҬ ݕ஌ʙࢭ݂ʹ͔͔Δ࣌ؒɻ ࠷΋ΫϦςΟΧϧ͕ͩɺΞϓϩʔν͠΍͍͢ IRFͷྖҬ ࢭ݂ʙࠜຊରԠ/ิਖ਼ͳͲޙॲཧʹ͔͔Δ࣌ؒɻ ͢Ͱʹࢭ݂͞Ε͍ͯΔͷͰɺ଎͞ΑΓ΋ਖ਼֬͞ ͕ٻΊΒΕΔ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν

Slide 63

Slide 63 text

63 MTTD(Mean Time To Detect) ʹର͢ΔΞϓϩʔν Ξϥʔτͷ੔උ • ݕ஌Ͱ͖ͳ͔ͬͨ/஗Εͨ໰୊ʹ৽͘͠ΞϥʔτΛ͚ͭΑ͏ɺ͸͏·͍͔͘ͳ͍ • “over-monitoring is a harder problem to solve than under-monitoring.” — SRE: How Google Runs Production System • False Positive Ξϥʔτଟ͗͢ɺຒ΋ΕΔ໰୊ • SLO + Error Budget ʹΑΔΞϥʔτ΁ͷγϑτ • Pager͞ΕͨΒΠϯγσϯτɺ͕ཧ૝ • ҰேҰ༦ʹ͸͍͔ͳ͍ʂ ࢒͞Εͨ՝୊ɻ̐ষͰ͓࿩͠·͢ʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν

Slide 64

Slide 64 text

64 MTTM(Mean Time To Mitigate) ʹର͢ΔΞϓϩʔν ౷ҰFWɿIRF2.0 • Πϯγσϯτͷج४ͷ໌ࣔԽ • ରԠϑϩʔɾίϛϡχέʔγϣϯΨΠυϥΠϯͷ౷Ұ • Responder / Commander ͷ෼཭ • ٴͼτϨʔχϯάɺΤʔεୡ͕എதΛݟͤΔ Τʔε౤ೖͱIRF2.0ͷਁಁ͕ޮՌ͖ͯΊΜʂ 3-2. Πϯγσϯτղܾ࣌ؒ΁ͷΞϓϩʔν

Slide 65

Slide 65 text

65 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ΠϯγσϯτࣗମΛݮΒ͢ʹ͸ʁ ͦΜͳ͊… ڭ܇̓:ѹ౗తΤʔεΛ࣋ͬͯΠϯγσϯτʹհೖͯ͠΋ Πϯγσϯτ͸ݮΒͳ͍ʂʂ

Slide 66

Slide 66 text

66 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ϘτϧωοΫʹΞϓϩʔν͢Δ Զୡʹ͸σʔλ͕͋Δ͡Όͳ͍͔ʂʂ ͍ͰΑΠϯγσϯτμογϡϘʔυʂ

Slide 67

Slide 67 text

67 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ϘτϧωοΫʹΞϓϩʔν͢Δ Ͳ͜ͰɺͲΜͳཧ༝Ͱো֐͕ى͜Γ΍͍͔͕͢ݟ͖͑ͯͨ σʔλͷཪ෇͚ΛݩʹɺͦΕͧΕͷݪҼʹରॲ͍ͯ͘͠ʂ

Slide 68

Slide 68 text

68 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ϘτϧωοΫʹΞϓϩʔν͢Δ #1 ΠϯγσϯτݪҼ ʮςετෆ଍ʯ

Slide 69

Slide 69 text

69 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿຊ൪౤ೖલͷςετʁ PostMortemʹͯ… • Why. ͳͥςετͤͣʹຊ൪ʹ౤ೖ͞ΕΔͷͰ͔͢ʁ • ຊ൪Ͱ͔͠ςετͰ͖ͳ͍͔ΒͰ͢ • Why. ͳͥຊ൪Ͱ͔͠ςετͰ͖ͳ͍ͷͰ͔͢ʁ • σʔλෆ଍ɺStaging౳ςετ؀ڥͷෆඋ • …. ϤγʂStaging؀ڥΛ੔උ͢Δͧʂ

Slide 70

Slide 70 text

70 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̍ɿStaging؀ڥ੔උ ಓ൒͹ɿ૝૾ͷ10ഒେมʂ • ίϯϙʔωϯτ͕ࢮ͵΄Ͳ͋Δ • News, Ads, InfraͱDivisionຖʹҟͳΔཁٻɺར༻ํ๏ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ͱΓ͋͑ͣ͢΂ͯSTG੔උ͠·͢ɺ͸ແཧͩ͠ҙຯ͕ͳͦ͞͏ɻ ൺֱతཁ๬͕େ͖͍Ads͔ΒରԠத

Slide 71

Slide 71 text

71 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTest͸ʁ • ͳͥUnitTestͰݕ஌Ͱ͖ͳ͔ͬͨͷ͔ʁ • UnitTest͕ͳ͍͔ΒͰ͢ • … 😭 ϤγʂςετΧόϨοδΛूΊΔͧʂ

Slide 72

Slide 72 text

72 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ෼ੳ • ςετΧόϨοδΛऩूͯ͠ͳ͍ γεςϜɾίϯϙʔωϯτʹಥܸ ͠ɺΧόϨοδΛग़͢PRΛྔ࢈ • γεςϜ͝ͱʹUTΧόϨοδ ͱɺো֐ൃੜ਺Λϓϩοτ # of Incident Ave. Coverage

Slide 73

Slide 73 text

73 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ෼ੳ • ςετΧόϨοδͱো֐͸૬ؔ͢Δͷ͔ʁ → ૬ؔ͸ग़ͨɻ • ͕ɺςετΧόϨοδΛ্͛Ε͹ো֐͕ݮΔ͔ʁ͸Θ͔Βͳ͍ ʢҼՌͰ͸ͳ͍ʣ • ͔͠͠υϝΠϯ஌ࣝΛ΋ͬͯɺγεςϜ/νʔϜ୯ҐͰݟͯΈΔͱɺ ͔֬ʹΧόϨοδ͕௿͘ɾΠϯγσϯτ͕ଟ͍ͱ͜Ζ͸ཧ༝͕͋Γ ͦ͏ • UTΛ࣮૷ͣ͠Β͍/UTΛ࣮૷͢ΔจԽ͕ͳ͍ etcetc… Ϥγʂͱʹ͔͘UT͕গͳ͍ͱ͜Ζʹಥܸͯ͠ UTΛ࣮૷͠·͘Δͧʂʂ

Slide 74

Slide 74 text

74 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ Get our Hands Dirtyɿยͬ୺͔ΒUTΛ͚ͭΔ 2. Sonarqube Ͱߦ਺͕ଟ͘ɺCoverage͕௿͍ϑΝΠϧΛݟ͚ͭΔ 3. LLMͷྗΛआΓͯUTΛ࣮૷͠·͘Δ 4. ίϯϙʔωϯτશମͰ> 50% ʹͳΔ·Ͱ܁Γฦ͠ ̏ʙ̐ίϯϙʔωϯτ΍͕ͬͨম͚ੴʹਫ αϯϓϧ͕͋Ε͹ɺޙ͸΍ͬͯ͘ΕΔͩΖ͏ͱࢥ͍ͬͯͨ…

Slide 75

Slide 75 text

75 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ςετෆ଍ʹର͢ΔΞϓϩʔν̎ɿUnitTestͷ੔උ • LLMͰࣗಈੜ੒ͨ͠Βʁ • ͍·ͷͱ͜Ζਫ਼౓͕͍·͍ͪ • ͦ΋ͦ΋UTΛܧଓతʹ࣮૷͢Δश׳͕νʔϜʹඞཁ • ͕ɺͦ͏͢ΔΠϯηϯςΟϒɺՁ஋ײ͕ແ͍ • ೲظʹ௥ΘΕ͍ͯͯɺUTʹׂ͕࣌ؒ͘ͳ͍(!) ૊৫ɺจԽ΁ͷΞϓϩʔν͕ඞཁʂ ࢒͞Εͨ՝୊ɻ4ষ΁ଓ͘…

Slide 76

Slide 76 text

76 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ϘτϧωοΫʹΞϓϩʔν͢Δ #2 ΠϯγσϯτݪҼ ʮઃఆมߋʯ

Slide 77

Slide 77 text

77 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ओͳ”ઃఆมߋ” • ΦϯϥΠϯͰΞϓϦέʔγϣϯͷڍಈΛ੍ޚ͢Δػߏ • A/BςετɺFeature Flag • ͲͪΒ΋ಠ࣮ࣗ૷ͷϓϥοτϑΥʔϜΛ͕࣋ͭɺෳࡶ • ҙਤ͠ͳ͍Ϣʔβʔ΁ͷA/BదԠ΍ɺޡͬͨઃఆʹΑΔ໰୊ ͕ଟൃ ϤγʂA/BςετͱϑΟʔνϟʔϑϥάΛ ੔උ͢Δͧʂʂ

Slide 78

Slide 78 text

78 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ઃఆมߋʹର͢ΔΞϓϩʔν • ࢖ΘΕ͍ͯͳ͍ʢσϑΥϧτԽͨ͠ʣϑΟʔνϟϑϥά ͷҰ੪࡟আ • ϑΟʔνϟʔϑϥάར༻ج४ͷࡦఆ • όϦσʔγϣϯͷڧԽ • ʢύʔεΤϥʔʹͳΔઃఆ͕ೖྗͰ͖͍ͯͨ…ʣ AB ςετϓϥοτϑΥʔϜνʔϜͱ΋ڠۀ͠ɺ ϢʔβʔϏϦςΟؚΊେ෯ͳվળΛਪਐʂ

Slide 79

Slide 79 text

79 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν ϘτϧωοΫʹΞϓϩʔν͢Δ #3 ΠϯγσϯτݪҼ ʮOffline Batchʯ …͍ͯ͏͔Flink

Slide 80

Slide 80 text

80 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ετϦʔϛϯάॲཧͳFilnkΦϑϥΠϯόον͕ଟ਺ • Server → Kafka → Flink → Scylla, ClickHouse, … • ઐ໳νʔϜʹΑΔಠࣗ։ൃϓϥοτϑΥʔϜ • ΞϓϦέʔγϣϯνʔϜʹFlinkΤΩεύʔτ͕গͳ͘ɺ ύϑΥʔϚϯε΍࠶ىಈ࣌ͷ໰୊͕ଟൃ ϤγʂFlink ϓϥοτϑΥʔϜΛ ੔උ͢Δͧʂʂ

Slide 81

Slide 81 text

81 3-3. Πϯγσϯτ਺΁ͷΞϓϩʔν Offline Batch ʹର͢ΔΞϓϩʔν • ϓϥοτϑΥʔϜࣗମͷվળ • UIͷվળɺࣗಈσϓϩΠɺ… • ϕετϓϥΫςΟεͷ෍ڭ • υΩϡϝϯτ੔උ • ςετΛؚΉςϯϓϨʔτϓϩδΣΫτͷެ։ • ֤ίϯϙʔωϯτʹ௚઀ϦϑΝΫλPRΛૹ෇ • ϕετϓϥΫςΟεͱςετΛ࣮૷ ϓϥοτϑΥʔϜνʔϜʹ΋ڠྗΛڼ͗ɺ ϓϥοτϑΥʔϜͷվળͱυΩϡϝϯτ੔උΛ࣮ࢪʂ

Slide 82

Slide 82 text

82 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ACT era vs Before ACT Πϯγσϯτ਺… +32% ૿Ճ!! MTTR… -48% ݮগ!! ൒ݮ!!!

Slide 83

Slide 83 text

83 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ACT era vs Before ACT Πϯγσϯτ਺… +32% ૿Ճ!! MTTR… -48% ݮগ!! ൒ݮ!!! ͪ ΐ ͬ ͱ ଴ ͯ͌ ʂ

Slide 84

Slide 84 text

84 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? Πϯγσϯτ૿͑ͯ·͢Α…ʁ • قઅੑʁ݄͕̍̎ඈͼൈ͚ͯଟ͍ • ٳՋલͷ׈ΓࠐΈมߋࣄނʁ • IRF2.0ਁಁͷ෭࡞༻ʁ • ΠϯγσϯτఆٛʹΑΔݕ஌ײ౓ ͷ޲্ • ϚζϩʔͷϋϯϚʔ: “΋͠IRF͔͍࣋ͬͯ͠ ͳ͚Ε͹ɺ͢΂͕ͯΠϯγσϯτʹݟ͑Δ” • ݄̍Ҏ߱͸ݮগ܏޲ ܧଓతͳվળ׆ಈ͕ඞཁ

Slide 85

Slide 85 text

85 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ҰํɺMTTR͸൒ݮʂ • ಛʹ MTTMitigate ʹܶతͳվળ • IRF2.0ޮՌͱߟ͍͑ͯΔ • ҰํMTTDetect͸େ͖ͳվળͳ͠ • Detection ͸ࠓޙͷ՝୊ɻΞ ϥʔτվળʹऔΓ૊Ή վળʹ͔֬ͳखԠ͑ʂ

Slide 86

Slide 86 text

86 3-4. ݁ՌɿΠϯγσϯτ͸”൒෼”ʹͳͬͨͷ͔!? ૯ධ Severityͷߏ੒ʹେ͖ͳมԽ͸ͳ͔ͬͨͨΊɺImpactϨϕϧͰ͸ඍݮʂ Πϯγσϯτ൒ݮͳΒͣ… ͕ͩ՝୊͸໌֬ɺվળͷ౔୆͸Ͱ͖ͨʂ ΍͍ͬͯ͜͏ͥ

Slide 87

Slide 87 text

ؼؐɿ ࢒͞Εͨ՝୊ͱ͜Ε͔Β 04

Slide 88

Slide 88 text

88 ࠶ɿΠϯγσϯτΛݮΒ͍ͨ͠ɻ͕… 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ ΠϯγσϯτΛݮΒ͢͜ͱ͹͔Γߟ͖͚͑ͯͨͲ… Πϯγσϯτ͸̌ʹ͸Ͱ͖ͳ͍ ͦΜͳ͊…

Slide 89

Slide 89 text

89 ͦ΋ͦ΋ΠϯγσϯτΛ̌ʹ͍ͨ͠ʢͰ͖Δʣͷ͔ʁ ݱ࣮తʹͲͪΒ΋ແཧ… • ΠϯγσϯτΛۃখԽ͢Δʹ͸ʁ • ϦϦʔεΛͳ͘͢ʁ • →؇΍͔ͳࢮ😇 • ແݶʹίετʢϦιʔεɺ࣌ؒʣΛ౤Լ͢Δʁ • ౤ೖͨ͠ϦιʔεͱΠϯγσϯτൃੜ཰͸(͓ͦΒ͘)૬ؔ͢Δ • Αͬͯɺ100%҆શͱߟ͑ΒΕΔ·Ͱͻͨ͢Βςετ͢Δ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ

Slide 90

Slide 90 text

90 ೲظͱ඼࣭ɺίετͱΠϯγσϯτͷόϥϯεΛऔΓ͍ͨ • ͕ɺγεςϜɾϓϩδΣΫτ͝ͱʹόϥϯε͸ҟͳΔ • ٻΊΒΕΔεϐʔυɺϦϦʔεස౓ • ౤ԼͰ͖Δίετ • ڐ༰Ͱ͖ΔϦεΫʢ㲈Πϯγσϯτ਺ɺมߋࣦഊ཰ʣ • ྫɿ • Ads͸toB, ͓ۚʹ௚݁ʂ͔ͬ͠Γ͔ͬͪΓ • News͸toC, ػೳఏڙεϐʔυ༏ઌʂ ʹϦεΫڐ༰౓Λ਺஋Խ͠ɺ ΠϯγσϯτΛίϯτʔϧ͍ͨ͠ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ

Slide 91

Slide 91 text

91 ϦεΫڐ༰౓ͷ໌ࣔԽɿSLOͱError Budget • SLO = αʔϏεϨϕϧ໨ඪ ~ ͲΕ͘Β͍Τϥʔ͕ڐ͞ΕΔ͔ʁ • e.g. 99.9% available -> 0.1%͸ڐ༰͞ΕΔ • ࣮ࡍʹUXʹة֐͕͋ΔSLI(Indicator)ʹObjective(໨ඪ)Λ͚ͭΔ • Error Budget = ڐ༰Ͱ͖ΔΤϥʔ͕͋ͱͲΕ͘Β͍࢒͍ͬͯΔ͔ • Error Budget ͕࢒͍ͬͯΔ = ΞΫηϧΛ౿ΊΔ • ଟগແ๳ͳϦϦʔε΋ڐ༰Ͱ͖Δ • Error Budget ͕ރׇͨ͠ = ڐ༰Ͱ͖ͳ͍ϨϕϧͷUXͷᆝଛ • ͜ΕҎ্ϦεΫΛऔͬͯ͸͍͚ͳ͍ɻεϐʔυμ΢ϯ — Ref: Implementing SLOs — Google SRE Error BudgetʹΑͬͯϦεΫڐ༰౓ΛදݱͰ͖Δ ཧ࿦తʹ͸ྑͦ͞͏ 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ

Slide 92

Slide 92 text

92 Ξϥʔτͷվળ ʔ Ξϥʔτ = Πϯγσϯτ ϤγʂSLO Λ੔උ͢Δͧʂʂ • Error Budget ͷফඅ଎౓, Burn Rate ʹΑͬͯΞϥʔτ͢Δ • ٸ଎ͳ Error Budget ফඅΛΞϥʔτ • ์ஔ͢Δͱ༧ࢉ͕ރׇ͢Δɻͭ·ΓSLOʹҧ൓͢Δ • →࣮ࡍͷUXʹة֐͕͋Δʂ • →์ஔͯ͠͸͍͚ͳ͍ʂʹΠϯγσϯτ — Ref: Alerting on SLOs 4-1. ࢒͞Εͨ՝୊ɿϦεΫ؅ཧ, Ξϥʔτվળ ྑͦ͞͏

Slide 93

Slide 93 text

93 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν Ұ෦ʹSLOΛಋೖͯ͠͸Έ͕ͨ… • ࣮ޮੑͷ͋ΔSLOΛఆٛɾ࣮૷͢Δͷ͸؆୯͡Όͳ͍ • ϏδωεɾPdMʹ͖͍ͯ΋౴͑Λ͍࣋ͬͯͳ͍ • ͦ΋ͦ΋࣮૷ʹ͔͚Δ࣌ؒΛ೧ग़Ͱ͖ͳ͍ • ςετͷ࣌ؒ͢Β೧ग़Ͱ͖ͳ͍ͷʹ… • Α͠Μ͹࣮૷Ͱ͖ͨͱͯɺ९क͞Εͳ͚Ε͹ҙຯ͕ͳ͍ ۜͷ஄ؙͳͲͳ͍…

Slide 94

Slide 94 text

94 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • ૊৫શମͷཧղͱڠྗ͕ෆՄܽ • ΤϯδχΞ͚ͩͰͳ͘ɺϏδωεɺPdM΋ר͖ࠐΉඞཁੑ • จԽ΁ͷΞϓϩʔν͕ඞཁ • ڀۃతʹ͸ͳʹΛՁ஋ͱ͢Δ͔ɺͱ͍͏࿩ • SLOͰίετͱϦεΫͷόϥϯεΛऔΔ͜ͱΛՁ஋Λ৴ ͡ɺ࣮ߦͰ͖Δ͔ ΤϯδχΞϦϯάจԽʹSLOɺ ͻ͍ͯ͸SREΛΠϯετʔϧ͍ͨ͠

Slide 95

Slide 95 text

95 4-2. ࢒͞Εͨ՝୊ɿ૊৫ͱจԽ΁ͷΞϓϩʔν SLO Λػೳͤ͞Δʹ͸ʁ • Ͳ͏Ξϓϩʔν͢Δ͔ • BottomUp: • Eng, Biz, PdM ΒεςʔΫϗϧμʔ΁ͷ෍ڭɺτ Ϩʔχϯά • TopDown: • ্૚෦͔Βͷࢧ࣋ɺࢦࣔ SRE, DevOps͸จԽɺҰ೔ʹͯ͠੒Βͣ ׼Λ͔͘஍ಓ͔ͭܧଓతͳ׆ಈ͕ඞཁ

Slide 96

Slide 96 text

96 4-3. ͜Ε͔Β… ACT൒೥ؒͷظݶ෇͖೚ظ͕ऴΘΖ͏ͱ͍ͯͨ͠ • ՝୊: SNͷΤϯδχΞϦϯάจԽʹSREΛΠϯετʔϧ͢Δ • SLO ͷ࣮૷ɺ९क • ଞʹ΋… • Observability ͷ޲্ • DORA Metrics ͷऩूٴͼ؂ࢹɺ९क, etcetc… → ܧଓతͳ׆ಈ͕ඞཁ ͜͜·ͰͷาΈΛࢭΊͣɺ ACTղࢄޙ΋࢒͞Εͨ՝୊ʹཱͪ޲͔͏ʹ͸ʁ

Slide 97

Slide 97 text

97 4-2. ͜Ε͔Β… ACTΛͲ͏ղࢄ͢Δ͔ʁ — ఏҊɿ”Distributed SRE Team” Ex-ACTor͕֤νʔϜʹ໭ͬͨ͋ͱɺ X%ͷ࣌ؒΛ࢖ͬͯSREͷ࢓ࣄΛܧଓ͢Δ ྑͦ͞͏ʹࢥ͕͑ͨ… ͳΔ΄Ͳʁ

Slide 98

Slide 98 text

98 4-2. ͜Ε͔Β… ACTΛͲ͏ղࢄ͢Δ͔ʁ — νʔϜͷ౴͑ • ٫Լʂ • ΈΜͳSREΛϑϧλΠϜδϣϒʹ͍ͨ͠Θ͚Ͱ͸ͳ͍ • “X%” ͷ࣌ؒΛΞϩέʔτ͢Δɺ͕ػೳͨͨ͠Ί͕͠ͳ͍ • ݁࿦ • Ex-ACTor͸ࠓޙ΋SREͷܒ໤΍खॿ͚Λߦ͏͕ɺ͕࣌ؒ ͔͔ͬͯ΋ઐ໳ͷSREνʔϜΛ্ཱͪ͛Δɻ νʔϜͰٞ࿦ܾͯ͠ΊΒΕͨɻ ΍Γ࢒ͨ͜͠ͱ͸ଟʑ͋Δ͕ޙչ͸ͳ͍ʂ

Slide 99

Slide 99 text

99 4-2. ͜Ε͔Β… ΅͘Βͷ Awesome Change! ACTͱͯ͠ͷ൒೥ؒͷʢΩπΠ!!ʣ೚ظ͸ऴΘͬͨ Awesome Change ͕࡞Ε͔ͨ…͸ਖ਼௚Θ͔Βͳ͍͚Ͳɺ SREͱ͍͏௕ཱྀ͍ͷҰาΛ౿Έग़ͤͨɺ ͱ͍͏ײ৮͸͋Δʂ ͦͯ͠6ϲ݄Λઓ͍ൈ͍ͨνʔϜϝΠτʹײँʂʂ

Slide 100

Slide 100 text

100 4-2. ͜Ε͔Β… ͋ͳͨͷ Awesome Change! ΍͍ͬͯ͜͏ͥ ΧϯϑΝϨϯεऴΘͬͯɺձࣾ໭ͬͯ ”͡Όɺࠓ೔͔ΒΠϯγσϯτݮΒ͍ͯͩ͘͠͞” …ͬͯݴΘΕͨΒɺΈͳ͞ΜͳΒԿ͔Β࢝Ί·͔͢ʁ

Slide 101

Slide 101 text

101 Զୡͷઓ͍͸ ͜Ε͔Βͩʂ! ࠓޙͷex-ACTorͷ׆༂ʹ͝ظ଴͍ͩ͘͞ʂʂ

Slide 102

Slide 102 text

Thank you for Your Kind Attention!

Slide 103

Slide 103 text

103 References • SREΛ͸͡ΊΑ͏―ݸਓͱ૊৫ʹΑΔ৴པੑ֫ಘ΁ͷୈҰา • SRE αΠτϦϥΠΞϏϦςΟΤϯδχΞϦϯά―Googleͷ৴པੑΛࢧ ͑ΔΤϯδχΞϦϯάνʔϜ • SRE Google Workbook • Effective DevOps 4ຊபʹΑΔ࣋ଓՄೳͳ૊৫จԽͷҭͯํ • Fearless Change ΞδϟΠϧʹޮ͘ ΞΠσΞΛ૊৫ʹ޿ΊΔͨΊͷ48 ͷύλʔϯ