Slide 1

Slide 1 text

RunbookʹԿΛॻ͖ɺ ͲͷΑ͏ʹΞϥʔτΛৼΓ෼͚Δ͔ʁ 2023/09/29 SRE NEXT 2023 Sohei Iwahori (GREE, Inc.)

Slide 2

Slide 2 text

Agenda » Introduction » Runbookͷ੔උ » ԿΛॻ͘΂͖͔ʁ » Runbookͷ࣮૷ » Ξϥʔτ௥ՃΨΠυϥΠϯͷࡦఆ » Recap

Slide 3

Slide 3 text

Introduction

Slide 4

Slide 4 text

άϦʔʹ͓͚ΔΠϯϑϥ؀ڥͷมભ » ΦϯϓϨ͔ΒΫϥ΢υʢ2014ࠒʙʣ » VM͔Βίϯςφ΁ʢ2019ࠒʙʣ » ݱঢ়͸͜ΕΒͷ؀ڥ͕͢΂ͯଘࡏ » ΦϯϓϨ / VMϕʔεʢEC2 on AWSʣ / ίϯςφϕʔεʢEKS/GKEʣ

Slide 5

Slide 5 text

ήʔϜܥͷϫʔΫϩʔυͷಛ௃ » αʔόαΠυ͸APIఏڙ͕ϝΠϯ » αʔϏεؒͷAPI࿈ܞ͸͋·Γͳ͍ʢ՝ۚܥͳͲڞ௨ػೳͱͷ΍ΓͱΓͷΈʣ » λΠτϧ͝ͱɺ৔߹ʹΑͬͯ͸ͦͷதͰ͞Βʹւ֎αʔϏε͝ͱͷ؀ڥΛ෼͚ ͍ͯΔ » ݁Ռগͣͭ͠ҧ͏ɺࣅͨΑ͏ͳϫʔΫϩʔυͷ؀ڥ͕਺ेଘࡏ͢Δ͜ͱʹͳΔ » Ξϥʔτϧʔϧ͸ڞ௨Խ͍ͨ͠ͷͰԣஅͰઃఆͭͭ͠ɺҰ෦Λݸผઃఆ

Slide 6

Slide 6 text

DevͱOpsͷ૊৫ߏ଄ » ࣄۀ෦͝ͱͷ։ൃνʔϜʹ਺ਓͷΠϯϑϥϝϯόʔ͕ͭ͘ܗ͕ࣜجຊ » ڞ௨ࢧԉ෦໳ͱͯ͠ͷΠϯϑϥ͕͍ͯɺͦͷதͰunitͱݺ͹ΕΔઐ໳ͷνʔ Ϝ͕͋Γɺڞ௨ج൫ͷఏڙʢ؂ࢹͳͲʣɺϛυϧ΢ΣΞʢRDBMSɺKVSͳ ͲʣͷࢧԉͳͲΛߦ͏ » Πϯϑϥ಺ͷӡ༻νʔϜ͕ଘࡏ͠Ұ࣍ରԠΛ੥͚ෛ͏

Slide 7

Slide 7 text

DevͱOpsͷ૊৫ߏ଄

Slide 8

Slide 8 text

యܕతͳো֐ରԠϑϩʔ » PagerDuty/SlackͰো֐ͷൃੜΛ௨஌ » ӡ༻νʔϜͰҰ࣍ରԠ࣮ࢪ » ো֐ରԠखॱॻʹج͍ͮͯఆܕతͳରԠΛ࣮ࢪ͢Δ » ղফ͠ͳ͍৔߹͸ϓϩμΫτʹ͍͍ͭͯΔΠϯϑϥΤϯδχΞɺ ·ͨ͸unitʢઐ໳νʔϜʣ΁ΤεΧϨ » EKS/GKE؀ڥʹ͓͍ͯ͸ӡ༻νʔϜΛܦ༝ͤͣ௚઀୲౰νʔϜ΁௨஌ ͢ΔΑ͏ʹͳ͖͍ͬͯͯΔ

Slide 9

Slide 9 text

యܕతͳো֐ରԠϑϩʔʢΤεΧϨʣ

Slide 10

Slide 10 text

ʮো֐ରԠखॱॻʯͷ՝୊ » ΤεΧϨͱͳͬͨ৔߹ʹରԠνʔϜ͕ࢀরग़དྷΔ৘ใ͕ͳ͍ » ΤεΧϨઌ͸͋Δఔ౓஌ࣝΛ࣋ͬͨલఏͰ͸͋Δ͕ɺਓͷೖΕସΘΓ΋౰વ͋Δ » ݕࡧੑͷ໰୊ » ConfluenceͰ1ϖʔδʹهࡌ͞ΕͨλΠτϧϕʔεͰͷݕࡧ » Ξϥʔτࣗମͷݟ௚͠ͷ೉͠͞ » ରԠํ๏ʢHowʣʹͷΈϑΥʔΧε͍ͯ͠ΔͷͰͳͥ͜ͷΞϥʔτ͕ ͋Δͷ͔(Why)͕Θ͔Βͳ͍

Slide 11

Slide 11 text

ো֐ରԠखॱͷݕࡧʢΠϝʔδʣ

Slide 12

Slide 12 text

ʮΞϥʔτΛ௥Ճ͍ͨ͠ʯ͔Βੜ͕ͪ͡ͳ՝୊ » ᐆດͳཁ݅ » ໰୊͕ى͖ͨͷͰΞϥʔτΛઃఆ͍ͨ͠ » ௨஌ํ๏ɺظ଴͢ΔΞΫγϣϯ͓ΑͼλΠϛϯά͸ᐆດͳ·· » ඇରশੑ » ԣஅͰϧʔϧΛ௥Ճ͢ΔࡍΞϥʔτΛઃఆ͢ΔਓΞϥʔτΛड͚Δਓ͕Ұக͠ͳ͍ » ͋Δ೔ಥવݟ஌Β͵ΞϥʔτΛड͚Δ͜ͱʹͳΔ » ίϯςΩετͷ૕ࣦ » ͕࣌ؒܦա͠ɺΞϥʔτΛ௥Ճͨ͠จ຺͕ࣦΘΕͯ͠·͏ » ʮͳʹ͔େࣄͳཧ༝͕͋ͬͨ͸ͣɾɾʯ

Slide 13

Slide 13 text

͜ΕΒͷ՝୊Λղܾ͢ΔͨΊͷΞΫγϣϯ » ৽نʹʮΤεΧϨઌͷ୲౰ऀʯΛओͳλʔήοτͱͨ͠RunbookΛ੔උ » ΞϥʔτΛ௥Ճ͢ΔࡍͷϑϩʔɺΨΠυϥΠϯΛ࡞੒ » ॱʹΈ͍͖ͯ·͢

Slide 14

Slide 14 text

Runbookͷ੔උ

Slide 15

Slide 15 text

ԿΛॻ͘΂͖͔ʁ

Slide 16

Slide 16 text

Runbookʹ͍ͭͯͷ༷ʑͳҙݟ(1/3) practiced on-call engineer armed with a playbook works much better. — Site Reliability Engineering /Chapter 1 - Introduction playbook(runbook)ͷਪ঑

Slide 17

Slide 17 text

Runbookʹ͍ͭͯͷ༷ʑͳҙݟ(2/3) Just like new code, new alerts should be thoroughly and thoughtfully reviewed. Each alert should have a corresponding playbook entry. — SRE Workbook / Chapter 8 - On-Call ΞϥʔτʹରԠͨ͠playbook(runbook)ͷΤϯτϦ͕͋Δ΂͖

Slide 18

Slide 18 text

Runbookʹ͍ͭͯͷ༷ʑͳҙݟ(3/3) The assertion that time spent creating runbooks is largely wasted may seem a bit harsh at first. — Observability Engineering / 8. Analyzing Events to Achieve Observability runbookʹ͔͚Δ࣌ؒ͸ແବ͔΋͠Εͳ͍

Slide 19

Slide 19 text

Runbook͕༗ޮʹػೳ͢Δ৚݅Λߟ͑ͯΈΔ » ஌ݟͷੵΈ্͕͛Մೳͳ؀ڥ » ࣅͨΑ͏ͳϫʔΫϩʔυ͕ෳ਺͋Δ » ϕʔεٕज़ͷมԽ͕ΏΔ΍͔ » ৗʹ࢐৽ͳʢ࠶ݱੑͷͳ͍ʣΞϥʔτ͕ൃੜ͢ΔΑ͏ͳ؀ڥʹ͸ෆ޲͖ » ͓ͦΒ͘ΦϒβʔόϏϦςΟΤϯδχΞతͳΞϓϩʔν͕ඞཁ » ϗϫΠτϘοΫεϕʔεͷΞϥʔτΛར༻/ซ༻͍ͯ͠Δ

Slide 20

Slide 20 text

RunbookʹԿΛॻ͘΂͖͔ʁʹ͍ͭͯͷҙݟ » ୹ظతͳղܾΛࢤ޲͢Δ͔ɺ௕ظతͳରԠ΋ؚΊͨώϯτ͔ » εςοϓόΠεςοϓͷखॱࢤ޲͔ɺશମతͳ஌ࣝͷڞ༗ࢤ޲͔ » ϒϥοΫϘοΫεϕʔεͷΞϥʔτʹରͯ͠͸ͦ΋ͦ΋खॱͷఏ͕ࣔ೉͍͠ ͷͰ͸ͳ͍͔

Slide 21

Slide 21 text

άϦʔʹ͓͚ΔRunbookࢤ޲ » ৆ຯظݶͷ୹͍୹ظతҊղܾࡦΑΓ௕࣋ͪ͢Δ৘ใ » എܠͷڞ༗Λॏࢹ » ΤϯδχΞʹΑΔ൑அɺΞΫγϣϯʹͭͳ͕Δ৘ใ » ରԠʹ͸ΤϯδχΞͷ஌ੑ͕ඞཁͳ͸ͣ » ղܾࡦ͕ఆܕతͳίϚϯυͰ͋Ε͹ࣗಈԽͰΑ͍ » ղܾʹͭͳ͕ΔίϯςΩετ / ώϯτΛఏࣔ͢΂͖

Slide 22

Slide 22 text

Runbookͷ࣮૷

Slide 23

Slide 23 text

RunbookςϯϓϨʔτ

Slide 24

Slide 24 text

Runbookαϯϓϧ

Slide 25

Slide 25 text

Runbookαϯϓϧ

Slide 26

Slide 26 text

࣮૷ » gitϨϙδτϦӡ༻ » Ξϥʔτ௨஌γεςϜ΁ͷ૊ΈࠐΈ » ඞཁͳλΠϛϯάͰίϯςΩετΛิ׬͢Δ » ϢϏΩλευΩϡϝϯτతΞϓϩʔνΛߟ͑Δ1 » ؆қͳTemplate͔Βͷ࡞੒ 1 Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline Tom Limoncelli / SREcon20 Americas.

Slide 27

Slide 27 text

Runbookͷ൓өϑϩʔ

Slide 28

Slide 28 text

Ξϥʔτ௨஌γεςϜ΁ͷ૊ΈࠐΈ

Slide 29

Slide 29 text

Runbook͕ଘࡏ͠ͳ͍৔߹͸࡞੒Λଅ͢

Slide 30

Slide 30 text

ޮՌͱ՝୊ » ʮ׬શͳະ஌ͷΞϥʔτʯΛड͚ΔػձΛݮΒ͢͜ͱ͕Ͱ͖ͨ » ৺ཧత҆શੑ » ΦϯϘʔσΟϯάͰ࿫͍͑ͯͳ͍෦෼ͷαϙʔτ » Ҿ͖ଓ͖՝୊ͱײ͡Δ΋ͷ » υΩϡϝϯτࣗମͷޮՌଌఆ » ఆظతͳϝϯςφϯεͷΩοΫ » ྫ͑͹ҰఆظؒͰauthorʹ௨஌͕͍͘࢓૊ΈͳͲ

Slide 31

Slide 31 text

Ξϥʔτ௥Ճ ΨΠυϥΠϯͷࡦఆ

Slide 32

Slide 32 text

νϟϯωϧબ୒ͱ ԣஅΞϥʔτͷͨΊͷΨΠυϥΠϯ

Slide 33

Slide 33 text

νϟϯωϧબ୒ΨΠυϥΠϯ

Slide 34

Slide 34 text

ԣஅΞϥʔτ௥ՃͷΨΠυϥΠϯ

Slide 35

Slide 35 text

௥Ճϑϩʔ΁ͷ૊ΈࠐΈ » ௥Ճϑϩʔʹ૊ΈࠐΉ » ௥Ճ͍ͨ͠ɺͱͳͬͨ৔߹ʹࣗવͱҎԼͷ͜ͱΛߟ͑ Δඞཁ͕͋Δ » ૝ఆνϟϯωϧ=͍ͭରԠ͢΂͖΋ͷ͔ » είʔϓ=ͲͷൣғͰద༻͢΂͖͔

Slide 36

Slide 36 text

ΨΠυϥΠϯࡦఆʹΑΔޮՌ » ΞϥʔτΛઃఆ͢Δஈ֊ͰΞΫγϣϯΛݻΊΒΕΔ » ͲͷλΠϛϯάͰ » Ͳͷ௨஌νϟϯωϧΛ࢖ͬͯ » ͲͷΑ͏ʹߦ͏͔ » ΞΫγϣϯΛߟ͑Δ͜ͱͰΞϥʔτࣗମͷඞཁੑ΋࠶ݕ౼Ͱ͖Δ » ΞϥʔτർΕͷܰݮ

Slide 37

Slide 37 text

Recap

Slide 38

Slide 38 text

Recap » Runbook͕͏·͘ϫʔΫ͢ΔͨΊͷ৚݅ » ੵΈ্͕͛Մೳ / ϗϫΠτϘοΫεϕʔεͷ௨஌Λར༻ » എܠɺίϯςΩετΛ௕ظతʹ఻͑Δ͜ͱ͸ॏཁ » Ξϥʔτͷઃఆ࣌ͷΨΠυ͸ରԠΞΫγϣϯΛ۩ମతʹߟ͑ΔͨΊʹ༗ޮ » ࣮૷༗ແɺܗଶ͸૊৫ͷ՝୊ʹ߹Θͤͯબ୒͢Δͱྑ͍

Slide 39

Slide 39 text

Thank you for listening

Slide 40

Slide 40 text

who? » Sohei Iwahori (@egmc) » GREE, Inc. » Πϯϑϥ / Monitoring Unit Leader » ήʔϜͷΠϯϑϥͱϞχλϦϯά

Slide 41

Slide 41 text

Appendix » Low Context DevOps: Improving SRE Team Culture through Defaults, Documentation, and Discipline » https://www.usenix.org/conference/srecon20americas/presentation/ limoncelli » GitLab On-call Run Books » https://gitlab.com/gitlab-com/runbooks » Dashboards and Runbooks: Scrapbooking for Engineers » https://www.usenix.org/conference/srecon22apac/presentation/douch

Slide 42

Slide 42 text

Appendix » ΦϯίʔϧΞϥʔτΞϯνύλʔϯ » https://dasalog.hatenablog.jp/entry/2022/05/23/141749