practices-for-making-alerts-actionable.pdf

3baceb46dfe36422370df30b8eab61d1?s=47 Sohei Iwahori
January 25, 2020

 practices-for-making-alerts-actionable.pdf

# Practices for Making Alerts Actionable
### 2020/01/25 SRE NEXT
#### Sohei Iwahori (GREE, Inc.)

## Agenda

- クラウド監視の現状とアラート増加の背景
- background story
- unactionableなアラート(静観アラート)の増加と影響
- どのようにしてアラートをアクション可能にしていくか?
- 定期的なモニタリング
- why unactionable?に対応していく
- Recap

3baceb46dfe36422370df30b8eab61d1?s=128

Sohei Iwahori

January 25, 2020
Tweet

Transcript

  1. Practices for Making Alerts Actionable 2020/01/25 SRE NEXT Sohei Iwahori

    (GREE, Inc.)
  2. Agenda » Ϋϥ΢υ؂ࢹͷݱঢ়ͱΞϥʔτ૿Ճͷഎܠ » background story » unactionableͳΞϥʔτʢ੩؍Ξϥʔτʣͷ૿ՃͱӨڹ » ͲͷΑ͏ʹͯ͠ΞϥʔτΛΞΫγϣϯՄೳʹ͍͔ͯ͘͠ʁ

    » ఆظతͳϞχλϦϯά » why unactionable?ʹରԠ͍ͯ͘͠ » Recap
  3. Ϋϥ΢υ؂ࢹͷݱঢ়ͱ Ξϥʔτ૿Ճͷഎܠ

  4. background story (1/̎) » 2015೥ࠒ͔Βେن໛ͳΦϯϓϨ->Ϋϥ΢υʢAWSʣҠߦΛ࣮ࢪ » طଘλΠτϧͰҠߦ͕ग़དྷΔ΋ͷΛΦϯϓϨ͔ΒҠߦ + ৽نλΠτϧ͸جຊతʹΫϥ΢υͰߏங »

    Ҡߦʹ߹Θͤ؂ࢹγεςϜΛ৽نʹߏங » ࠓճ͸ओʹΫϥ΢υ؀ڥͰՔಇ͢ΔλΠτϧ͕΄ͱΜͲͱͳͬͨ2018 ೥ࠒʹ࣮ࢪͨ͠ΞϥʔτରԠͷఆظϞχλϦϯάͱվળ+ैདྷͷऔΓ ૊ΈΛ͋Θ͓ͤͯ࿩͠·͢
  5. background story (2/2) » ݱঢ়ͷ؀ڥͷن໛ײ » جຊ͸αʔϏεʢେମ͸ήʔϜλΠτϧ͝ͱɺϦʔδϣϯ͕෼͔Ε ͍ͯΔ৔߹͸جຊతʹผΞΧ΢ϯτͰఏڙʣ » 50+

    Production AWSΞΧ΢ϯτ » Total 2000+ Ծ૝ϗετ » ϞχλϦϯάͷηοτ͸VPC≒ΞΧ΢ϯτ͝ͱʹଘࡏ » 1؀ڥ͋ͨΓ͸ଟͯ͘਺ඦ୆ن໛Ҏ಺
  6. Ϋϥ΢υ؀ڥͷ؂ࢹγεςϜ

  7. ݱࡏͷΫϥ΢υ؀ڥ؂ࢹߏ੒

  8. Ϋϥ΢υ؀ڥͷ؂ࢹ(1/2) » Prometheus + Grafana » gangiʢ಺੡exporterʣ » ΦϯϓϨͰ࢖༻͍ͯ͠ΔಠࣗϝτϦΫεऔಘͷͨΊͷganglia pluginΛ࠶ར༻͢ΔͨΊͷ΋ͷ

    » gmondͱ͍͏gangliaͷΤʔδΣϯτ͔ΒσʔλΛऔಘ͠ prometheusͰpullग़དྷΔΑ͏ʹ͢Δ » fluentdʢϩά؂ࢹ+Ξϥʔτ఻ୡʣ
  9. Ϋϥ΢υ؀ڥͷ؂ࢹ(2/2) » amanekoʢ಺੡؂ࢹΤʔδΣϯτʣ » cronʹΑΔ࣮ߦͰ֤छνΣοΫϓϥάΠϯΛ࣮ߦ͢Δ΋ͷ » yusuraʢ಺੡ΞϥʔτίϯτϩʔϧγεςϜʣ » ू໿ͨ͠ΞϥʔτͷৼΓ෼͚ɺ཈੍ɺαϚϥΠζΛߦ͏ϫʔΧʔ »

    ϓϥΨϒϧͳΞϥʔτ௨஌ػߏʢSlackɺPagerDutyɺJIRAɺ Amazon SNS౳ʣ
  10. Ϋϥ΢υҠߦʹΑΓͲ͏ͳ͔ͬͨʁ

  11. ⚡ Ξϥʔτͷ૿Ճ

  12. None
  13. Why so many alerts? » Ҡߦલ͔Βྺ࢙తʹΞϥʔτϧʔϧ͸ࡉ͔͘ઃఆ͞Ε͍ͯͨ » Ξϥʔτϧʔϧ͸جຊతΦϯϓϨ؀ڥΛ౿ऻ » ωοτϫʔΫͷҰ࣌తͳૄ௨ෆՄ౳ʹΑΔ૿Ճ

    » ΦʔτεέʔϦϯάʹΑΔӨڹ » ୆਺ௐ੔͕ΞάϨογϒʹ » εέʔϧΠϯɺΞ΢τ࣌ͷ໰୊ʹΑΔ૿Ճ » ͬ͘͟ΓΦϯϓϨ༝དྷͷϧʔϧͰ͸ݕ஌͞ΕΔࣄ৅͕ଟ͘ͳͬͨ
  14. Why unactionable? » ྺ࢙తͳܦҢʹΑΔʮݕ஌ʯͷͨΊͷΞϥʔτઃఆ » ݕ஌͠ɺͦͷޙΞΫγϣϯΛ൑அ͢ΔΑ͏ͳΞϥʔτ » ʮҰ࣌తͳ͖͍͠஋௒աʯ͕ߴස౓Ͱൃੜ » ݁Ռʮ੩؍ʯରԠͱͳΔΞϥʔτ͕ଟ͘ͳͬͨ

    » ੩؍ʹ෮چΞΫγϣϯͷඞཁͳ͘ࣄ৅͕͓͞·ͬͨͷͰղܾͱ͢ Δ΋ͷ
  15. Why we should resolve? Every time the pager goes off,

    I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued. -- Site Reliability Engineering / Chapter 6 - Monitoring Distributed Systems » αʔϏεϨϕϧΛߴΊΔͨΊͷ۩ମతͳΞΫγϣϯΛՄೳͱ͢Δखஈ ͱͯ͠ͷΞϥʔτͰ͋Δ΂͖
  16. ͲͷΑ͏ʹͯ͠Ξϥʔτ ΛΞΫγϣϯՄೳʹͯ͠ ͍͔͘ʁ

  17. ͳͥΞΫγϣϯ͢Δ͜ͱ͕Ͱ͖ͳ͍ͷ͔ʁ » ͦ΋ͦ΋On-callΞϥʔτ͕ଟ͗͢Δ » ੩؍ɺ͋Δ͍͸ఆܕతͳରԠ͔͍ͯ͠͠ͳ͍ » ΞΫγϣϯΛى͜͢ඞཁ͸͋Δ͕ɺۓٸͰ͸ͳ͍ » ʮݕ஌ʯΛ͍ͨ͠ͱ͍͏ཧ༝ʹΑΓઃఆ͞Ε͍ͯΔ »

    ΞϥʔτΛड͚͕ͨ൑அ͕೉͍͠
  18. ఆظతͳϞχλϦϯά » ݄͝ͱʹOn-callΞϥʔτΛूܭ͠ ͯSlackʹ௨஌͢Δ » PagerDutyͷIncidentΛ SumoLogicͰूܭͯ͠௨஌ » ্Ґ10݅Λର৅ʹͲ͏ѻ͏΂͖͔Λ ຖ݄ݕ౼

  19. ରԠΛܾΊ͍ͯ͘ » ͦ΋ͦ΋On-callΞϥʔτͰ͋Δඞཁ͕͋Δ͔ʁ » ൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏ » ਓͷखͰରԠ͢Δඞཁ͕͋Δ͔ʁ » ࣗಈ෮چͷݕ౼ »

    ରԠͷج४͕໌֬Ͱ͋Δ͔ʁ » ΞΫγϣϯʹ݁ͼͭ͘ࢦඪͷ੔උ » ΞΫγϣϯ͠΍͍͢࢓૊Έͮ͘Γ
  20. ൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏

  21. ൑ఆ৚݅Λద੾ʹߦ͏ » ݫ͗͢͠ΔΞϥʔτ৚݅ͷ؇࿨ » Ұ࣌తͳεύΠΫɺޡݕ஌Λ཈͑ΔରԠ » prometheusͷruleʹ͓͚Δfor x minͰͷظؒʹΑΔ൑ఆ »

    νΣοΫܥͷ؂ࢹͷoccurrencesʹΑΔ࿈ଓݕ஌൑ఆͳͲ
  22. ৼΓ෼͚Λద੾ʹߦ͏ » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ൃੜ͍ͯ͠Δ͜ͱΛݕ஌͍ͨ͠ » slackʹͷΈ௨஌ʢྲྀྔʹ஫ҙ͢Δʣ » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ԿΒ͔ͷରԠ͸ߦ͍͍ͨ » Ξϥʔτ͔ΒJIRAνέοτΛࣗಈىථ͢Δ »

    ͦͷ৔Ͱ௚ͪʹ۩ମతͳΞΫγϣϯ͕ඞཁ » PagerDuty΁௨஌
  23. slack௨஌ + JIRAνέοτىථ

  24. ༨ஊɿscheduled-event- notifier » Ξϥʔτ͔Βͷνέοτىථͷ࢓૊ ΈΛྲྀ༻ͯ͠ɺରԠ͕ඞཁͳAWSϦ ιʔεͷϝϯςφϯε௨஌ΛࣗಈͰ νέοτԽͯ͠ӡ༻νʔϜͰ؅ཧ

  25. None
  26. ࣗಈ෮چͷݕ౼

  27. άϦʔʹ͓͚Δࣗಈ෮چ » Alert Operator » ΞϥʔτछผʹԠͯ͡AWS SSMͰఆܕతͳ෮چίϚϯυΛ࣮ߦ͢ ΔAWS Lambda Function

    » ϓϩηεͷ࠶ىಈͳͲ୯७ͳΞΫγϣϯΛࢼߦ͢Δ » ਓؒʹ௨஌͞ΕΔͷͱಉ͡Ξϥʔτͷ৘ใΛडऔΓɺ ରԠ͢ΔίϚϯυΛ࣮ߦɺ݁ՌΛνϟοτ௨஌͢Δ
  28. Alert Operator࣮ߦྫ ྫ͑͹͜ͷΑ͏ͳίϚϯυΛ࣮ߦ͢Δ service xxx status sleep 10 service xxx

    restart service xxx status
  29. ࣗಈ෮چͷಋೖϓϥΫςΟε » ೖΕ΍͍͢ͱ͜Ζ͔Β » ෭࡞༻͕ͳ͍ɺίϚϯυ࣮ߦલΑΓঢ়ଶ͕ѱ͘ͳΒͳ͍ » ࣗಈରԠ͕ࣦഊͨ͠৔߹͸ผϧʔϧͰݕ஌ » ࣗಈରԠ͕࣮ߦ͞ΕΔΞϥʔτϧʔϧͱɺਓ͕ؒΈΔϧʔϧΛ෼ ͚ΔͳͲ

  30. ରԠج४ͷ໌֬Խ

  31. άϦʔඪ४ࢦඪͱͯ͠ͷSysLoad » ࣾ಺Ͱ͸ྺ࢙తʹαʔόෛՙʹରͯ͠ͷڞ௨ࢦඪͱͯ͠SysLoadΛఆ ٛɺར༻͍ͯ͠Δ » https://github.com/gree/sysload » The Four Golden

    Signalsʹ͓͚ΔSaturationΛՄࢹԽ͢ΔϝτϦ Ϋε » ʮSysLoad͕100ʹͳ͍ͬͯΔ৔߹͍ͣΕ͔ͷϦιʔε͕๞࿨͍ͯ͠ Δʯ͜ͱ͕୭͕ݟͯ΋Θ͔Δ
  32. SysLoadͷఆٛ sysload evaluates the following three elements. The maximum value

    of each is 100. ALL CPU utilization disk I/O utilization CPU Utilization in which interrupt from NIC is occurring » ͍ͣΕ͔ͷαʔόϦιʔε͕๞࿨ঢ়ଶʹ͋Δ=ߴෛՙͰ͋Δ͜ͱ͕Θ ͔Δ » ࠷ॳͷ੾Γ෼͚ϙΠϯτͱͯ͠࢖༻Ͱ͖Δ
  33. SysLoadάϥϑͷӡ༻ » SysLoadෳ߹άϥϑʹ͸ิॿઢ͕͋Γɺ੺ͷϥΠϯ=80%Λ௒͍͑ͯ Δ৔߹͸ରԠΞΫγϣϯΛऔΔ΂͖Ͱ͋Δ͜ͱ͕Θ͔Δ

  34. ϩʔϧ͝ͱͷμογϡϘʔυ੔උ

  35. ͦͷଞɺΞΫγϣϯΛิॿ͢Δ࢓૊Έ » chatbot » खಈखॱΛஔ͖׵͑Δखஈͱͯ͠ » AWSͷΦʔτεέʔϦϯάʹର͢Δૢ࡞ͳͲ » [wip]௨஌ͱͱ΋ʹrunbookΛఏڙ »

    cookpad͞Μ͕औΓ૊·Ε͍ͯͨΑ͏ͳ΋ͷ » طଘͷखॱॻͱซ༻͢Δ૝ఆͰ੔උத
  36. Recap

  37. Recap » ΞϥʔτͷܭଌΛߦ͏ » औΔ΂͖ΞΫγϣϯʹ߹Θͤͯద੾ͳৼΓ෼͚Λݕ౼͢Δ » ఆܕతͳΞΫγϣϯ͸ࣗಈରԠΛݕ౼͢Δ » Ξϥʔτ௨஌ʹܦ࿏ɺ௥ՃΛ͠΍͍͢ػߏΛೖΕ͓ͯ͘ »

    ΞΫγϣϯΛऔΓ΍͍͢ࢦඪɺ࢓૊ΈɺखॱΛ༻ҙ͢Δ
  38. thank you for listening

  39. who? » Sohei Iwahori (@egmc) » GREE, Inc. » Πϯϑϥ

    / Monitoring Unit Leader » ήʔϜͷΠϯϑϥͱαʔό؂ࢹɺओ ʹAWS
  40. Appendix » ࣮ફɹࣗಈ෮چ(hiroaki.kobayashi) » https://www.slideshare.net/greetech/ss-140561840 » ̒೥͘Β͍લʹࣗ࡞ͨ͠ metric ͕ͦͦ͜͜༗༻ͩͱࢥ͏ͷͰɺOSS Ͱެ։͠·͢(takanori.sejima)

    » https://labs.gree.jp/blog/2018/12/17645/ » sysload΍؂ࢹͳͲͷ࿩ʢԾʣ(takanori.sejima) » https://www.slideshare.net/takanorisejima/ sysload-133365308
  41. Appendix » SQSɺElastiCacheɺLambdaͰ࡞ΔߴՄ༻ͳΞϥʔτ௨஌γεςϜ (sohei.iwahori) » https://labs.gree.jp/blog/2017/05/16483/