Slide 1

Slide 1 text

Practices for Making Alerts Actionable 2020/01/25 SRE NEXT Sohei Iwahori (GREE, Inc.)

Slide 2

Slide 2 text

Agenda » Ϋϥ΢υ؂ࢹͷݱঢ়ͱΞϥʔτ૿Ճͷഎܠ » background story » unactionableͳΞϥʔτʢ੩؍Ξϥʔτʣͷ૿ՃͱӨڹ » ͲͷΑ͏ʹͯ͠ΞϥʔτΛΞΫγϣϯՄೳʹ͍͔ͯ͘͠ʁ » ఆظతͳϞχλϦϯά » why unactionable?ʹରԠ͍ͯ͘͠ » Recap

Slide 3

Slide 3 text

Ϋϥ΢υ؂ࢹͷݱঢ়ͱ Ξϥʔτ૿Ճͷഎܠ

Slide 4

Slide 4 text

background story (1/̎) » 2015೥ࠒ͔Βେن໛ͳΦϯϓϨ->Ϋϥ΢υʢAWSʣҠߦΛ࣮ࢪ » طଘλΠτϧͰҠߦ͕ग़དྷΔ΋ͷΛΦϯϓϨ͔ΒҠߦ + ৽نλΠτϧ͸جຊతʹΫϥ΢υͰߏங » Ҡߦʹ߹Θͤ؂ࢹγεςϜΛ৽نʹߏங » ࠓճ͸ओʹΫϥ΢υ؀ڥͰՔಇ͢ΔλΠτϧ͕΄ͱΜͲͱͳͬͨ2018 ೥ࠒʹ࣮ࢪͨ͠ΞϥʔτରԠͷఆظϞχλϦϯάͱվળ+ैདྷͷऔΓ ૊ΈΛ͋Θ͓ͤͯ࿩͠·͢

Slide 5

Slide 5 text

background story (2/2) » ݱঢ়ͷ؀ڥͷن໛ײ » جຊ͸αʔϏεʢେମ͸ήʔϜλΠτϧ͝ͱɺϦʔδϣϯ͕෼͔Ε ͍ͯΔ৔߹͸جຊతʹผΞΧ΢ϯτͰఏڙʣ » 50+ Production AWSΞΧ΢ϯτ » Total 2000+ Ծ૝ϗετ » ϞχλϦϯάͷηοτ͸VPC≒ΞΧ΢ϯτ͝ͱʹଘࡏ » 1؀ڥ͋ͨΓ͸ଟͯ͘਺ඦ୆ن໛Ҏ಺

Slide 6

Slide 6 text

Ϋϥ΢υ؀ڥͷ؂ࢹγεςϜ

Slide 7

Slide 7 text

ݱࡏͷΫϥ΢υ؀ڥ؂ࢹߏ੒

Slide 8

Slide 8 text

Ϋϥ΢υ؀ڥͷ؂ࢹ(1/2) » Prometheus + Grafana » gangiʢ಺੡exporterʣ » ΦϯϓϨͰ࢖༻͍ͯ͠ΔಠࣗϝτϦΫεऔಘͷͨΊͷganglia pluginΛ࠶ར༻͢ΔͨΊͷ΋ͷ » gmondͱ͍͏gangliaͷΤʔδΣϯτ͔ΒσʔλΛऔಘ͠ prometheusͰpullग़དྷΔΑ͏ʹ͢Δ » fluentdʢϩά؂ࢹ+Ξϥʔτ఻ୡʣ

Slide 9

Slide 9 text

Ϋϥ΢υ؀ڥͷ؂ࢹ(2/2) » amanekoʢ಺੡؂ࢹΤʔδΣϯτʣ » cronʹΑΔ࣮ߦͰ֤छνΣοΫϓϥάΠϯΛ࣮ߦ͢Δ΋ͷ » yusuraʢ಺੡ΞϥʔτίϯτϩʔϧγεςϜʣ » ू໿ͨ͠ΞϥʔτͷৼΓ෼͚ɺ཈੍ɺαϚϥΠζΛߦ͏ϫʔΧʔ » ϓϥΨϒϧͳΞϥʔτ௨஌ػߏʢSlackɺPagerDutyɺJIRAɺ Amazon SNS౳ʣ

Slide 10

Slide 10 text

Ϋϥ΢υҠߦʹΑΓͲ͏ͳ͔ͬͨʁ

Slide 11

Slide 11 text

⚡ Ξϥʔτͷ૿Ճ

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Why so many alerts? » Ҡߦલ͔Βྺ࢙తʹΞϥʔτϧʔϧ͸ࡉ͔͘ઃఆ͞Ε͍ͯͨ » Ξϥʔτϧʔϧ͸جຊతΦϯϓϨ؀ڥΛ౿ऻ » ωοτϫʔΫͷҰ࣌తͳૄ௨ෆՄ౳ʹΑΔ૿Ճ » ΦʔτεέʔϦϯάʹΑΔӨڹ » ୆਺ௐ੔͕ΞάϨογϒʹ » εέʔϧΠϯɺΞ΢τ࣌ͷ໰୊ʹΑΔ૿Ճ » ͬ͘͟ΓΦϯϓϨ༝དྷͷϧʔϧͰ͸ݕ஌͞ΕΔࣄ৅͕ଟ͘ͳͬͨ

Slide 14

Slide 14 text

Why unactionable? » ྺ࢙తͳܦҢʹΑΔʮݕ஌ʯͷͨΊͷΞϥʔτઃఆ » ݕ஌͠ɺͦͷޙΞΫγϣϯΛ൑அ͢ΔΑ͏ͳΞϥʔτ » ʮҰ࣌తͳ͖͍͠஋௒աʯ͕ߴස౓Ͱൃੜ » ݁Ռʮ੩؍ʯରԠͱͳΔΞϥʔτ͕ଟ͘ͳͬͨ » ੩؍ʹ෮چΞΫγϣϯͷඞཁͳ͘ࣄ৅͕͓͞·ͬͨͷͰղܾͱ͢ Δ΋ͷ

Slide 15

Slide 15 text

Why we should resolve? Every time the pager goes off, I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued. -- Site Reliability Engineering / Chapter 6 - Monitoring Distributed Systems » αʔϏεϨϕϧΛߴΊΔͨΊͷ۩ମతͳΞΫγϣϯΛՄೳͱ͢Δखஈ ͱͯ͠ͷΞϥʔτͰ͋Δ΂͖

Slide 16

Slide 16 text

ͲͷΑ͏ʹͯ͠Ξϥʔτ ΛΞΫγϣϯՄೳʹͯ͠ ͍͔͘ʁ

Slide 17

Slide 17 text

ͳͥΞΫγϣϯ͢Δ͜ͱ͕Ͱ͖ͳ͍ͷ͔ʁ » ͦ΋ͦ΋On-callΞϥʔτ͕ଟ͗͢Δ » ੩؍ɺ͋Δ͍͸ఆܕతͳରԠ͔͍ͯ͠͠ͳ͍ » ΞΫγϣϯΛى͜͢ඞཁ͸͋Δ͕ɺۓٸͰ͸ͳ͍ » ʮݕ஌ʯΛ͍ͨ͠ͱ͍͏ཧ༝ʹΑΓઃఆ͞Ε͍ͯΔ » ΞϥʔτΛड͚͕ͨ൑அ͕೉͍͠

Slide 18

Slide 18 text

ఆظతͳϞχλϦϯά » ݄͝ͱʹOn-callΞϥʔτΛूܭ͠ ͯSlackʹ௨஌͢Δ » PagerDutyͷIncidentΛ SumoLogicͰूܭͯ͠௨஌ » ্Ґ10݅Λର৅ʹͲ͏ѻ͏΂͖͔Λ ຖ݄ݕ౼

Slide 19

Slide 19 text

ରԠΛܾΊ͍ͯ͘ » ͦ΋ͦ΋On-callΞϥʔτͰ͋Δඞཁ͕͋Δ͔ʁ » ൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏ » ਓͷखͰରԠ͢Δඞཁ͕͋Δ͔ʁ » ࣗಈ෮چͷݕ౼ » ରԠͷج४͕໌֬Ͱ͋Δ͔ʁ » ΞΫγϣϯʹ݁ͼͭ͘ࢦඪͷ੔උ » ΞΫγϣϯ͠΍͍͢࢓૊Έͮ͘Γ

Slide 20

Slide 20 text

൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏

Slide 21

Slide 21 text

൑ఆ৚݅Λద੾ʹߦ͏ » ݫ͗͢͠ΔΞϥʔτ৚݅ͷ؇࿨ » Ұ࣌తͳεύΠΫɺޡݕ஌Λ཈͑ΔରԠ » prometheusͷruleʹ͓͚Δfor x minͰͷظؒʹΑΔ൑ఆ » νΣοΫܥͷ؂ࢹͷoccurrencesʹΑΔ࿈ଓݕ஌൑ఆͳͲ

Slide 22

Slide 22 text

ৼΓ෼͚Λద੾ʹߦ͏ » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ൃੜ͍ͯ͠Δ͜ͱΛݕ஌͍ͨ͠ » slackʹͷΈ௨஌ʢྲྀྔʹ஫ҙ͢Δʣ » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ԿΒ͔ͷରԠ͸ߦ͍͍ͨ » Ξϥʔτ͔ΒJIRAνέοτΛࣗಈىථ͢Δ » ͦͷ৔Ͱ௚ͪʹ۩ମతͳΞΫγϣϯ͕ඞཁ » PagerDuty΁௨஌

Slide 23

Slide 23 text

slack௨஌ + JIRAνέοτىථ

Slide 24

Slide 24 text

༨ஊɿscheduled-event- notifier » Ξϥʔτ͔Βͷνέοτىථͷ࢓૊ ΈΛྲྀ༻ͯ͠ɺରԠ͕ඞཁͳAWSϦ ιʔεͷϝϯςφϯε௨஌ΛࣗಈͰ νέοτԽͯ͠ӡ༻νʔϜͰ؅ཧ

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

ࣗಈ෮چͷݕ౼

Slide 27

Slide 27 text

άϦʔʹ͓͚Δࣗಈ෮چ » Alert Operator » ΞϥʔτछผʹԠͯ͡AWS SSMͰఆܕతͳ෮چίϚϯυΛ࣮ߦ͢ ΔAWS Lambda Function » ϓϩηεͷ࠶ىಈͳͲ୯७ͳΞΫγϣϯΛࢼߦ͢Δ » ਓؒʹ௨஌͞ΕΔͷͱಉ͡Ξϥʔτͷ৘ใΛडऔΓɺ ରԠ͢ΔίϚϯυΛ࣮ߦɺ݁ՌΛνϟοτ௨஌͢Δ

Slide 28

Slide 28 text

Alert Operator࣮ߦྫ ྫ͑͹͜ͷΑ͏ͳίϚϯυΛ࣮ߦ͢Δ service xxx status sleep 10 service xxx restart service xxx status

Slide 29

Slide 29 text

ࣗಈ෮چͷಋೖϓϥΫςΟε » ೖΕ΍͍͢ͱ͜Ζ͔Β » ෭࡞༻͕ͳ͍ɺίϚϯυ࣮ߦલΑΓঢ়ଶ͕ѱ͘ͳΒͳ͍ » ࣗಈରԠ͕ࣦഊͨ͠৔߹͸ผϧʔϧͰݕ஌ » ࣗಈରԠ͕࣮ߦ͞ΕΔΞϥʔτϧʔϧͱɺਓ͕ؒΈΔϧʔϧΛ෼ ͚ΔͳͲ

Slide 30

Slide 30 text

ରԠج४ͷ໌֬Խ

Slide 31

Slide 31 text

άϦʔඪ४ࢦඪͱͯ͠ͷSysLoad » ࣾ಺Ͱ͸ྺ࢙తʹαʔόෛՙʹରͯ͠ͷڞ௨ࢦඪͱͯ͠SysLoadΛఆ ٛɺར༻͍ͯ͠Δ » https://github.com/gree/sysload » The Four Golden Signalsʹ͓͚ΔSaturationΛՄࢹԽ͢ΔϝτϦ Ϋε » ʮSysLoad͕100ʹͳ͍ͬͯΔ৔߹͍ͣΕ͔ͷϦιʔε͕๞࿨͍ͯ͠ Δʯ͜ͱ͕୭͕ݟͯ΋Θ͔Δ

Slide 32

Slide 32 text

SysLoadͷఆٛ sysload evaluates the following three elements. The maximum value of each is 100. ALL CPU utilization disk I/O utilization CPU Utilization in which interrupt from NIC is occurring » ͍ͣΕ͔ͷαʔόϦιʔε͕๞࿨ঢ়ଶʹ͋Δ=ߴෛՙͰ͋Δ͜ͱ͕Θ ͔Δ » ࠷ॳͷ੾Γ෼͚ϙΠϯτͱͯ͠࢖༻Ͱ͖Δ

Slide 33

Slide 33 text

SysLoadάϥϑͷӡ༻ » SysLoadෳ߹άϥϑʹ͸ิॿઢ͕͋Γɺ੺ͷϥΠϯ=80%Λ௒͍͑ͯ Δ৔߹͸ରԠΞΫγϣϯΛऔΔ΂͖Ͱ͋Δ͜ͱ͕Θ͔Δ

Slide 34

Slide 34 text

ϩʔϧ͝ͱͷμογϡϘʔυ੔උ

Slide 35

Slide 35 text

ͦͷଞɺΞΫγϣϯΛิॿ͢Δ࢓૊Έ » chatbot » खಈखॱΛஔ͖׵͑Δखஈͱͯ͠ » AWSͷΦʔτεέʔϦϯάʹର͢Δૢ࡞ͳͲ » [wip]௨஌ͱͱ΋ʹrunbookΛఏڙ » cookpad͞Μ͕औΓ૊·Ε͍ͯͨΑ͏ͳ΋ͷ » طଘͷखॱॻͱซ༻͢Δ૝ఆͰ੔උத

Slide 36

Slide 36 text

Recap

Slide 37

Slide 37 text

Recap » ΞϥʔτͷܭଌΛߦ͏ » औΔ΂͖ΞΫγϣϯʹ߹Θͤͯద੾ͳৼΓ෼͚Λݕ౼͢Δ » ఆܕతͳΞΫγϣϯ͸ࣗಈରԠΛݕ౼͢Δ » Ξϥʔτ௨஌ʹܦ࿏ɺ௥ՃΛ͠΍͍͢ػߏΛೖΕ͓ͯ͘ » ΞΫγϣϯΛऔΓ΍͍͢ࢦඪɺ࢓૊ΈɺखॱΛ༻ҙ͢Δ

Slide 38

Slide 38 text

thank you for listening

Slide 39

Slide 39 text

who? » Sohei Iwahori (@egmc) » GREE, Inc. » Πϯϑϥ / Monitoring Unit Leader » ήʔϜͷΠϯϑϥͱαʔό؂ࢹɺओ ʹAWS

Slide 40

Slide 40 text

Appendix » ࣮ફɹࣗಈ෮چ(hiroaki.kobayashi) » https://www.slideshare.net/greetech/ss-140561840 » ̒೥͘Β͍લʹࣗ࡞ͨ͠ metric ͕ͦͦ͜͜༗༻ͩͱࢥ͏ͷͰɺOSS Ͱެ։͠·͢(takanori.sejima) » https://labs.gree.jp/blog/2018/12/17645/ » sysload΍؂ࢹͳͲͷ࿩ʢԾʣ(takanori.sejima) » https://www.slideshare.net/takanorisejima/ sysload-133365308

Slide 41

Slide 41 text

Appendix » SQSɺElastiCacheɺLambdaͰ࡞ΔߴՄ༻ͳΞϥʔτ௨஌γεςϜ (sohei.iwahori) » https://labs.gree.jp/blog/2017/05/16483/