Upgrade to Pro — share decks privately, control downloads, hide ads and more …

practices-for-making-alerts-actionable.pdf

Sohei Iwahori
January 25, 2020

 practices-for-making-alerts-actionable.pdf

# Practices for Making Alerts Actionable
### 2020/01/25 SRE NEXT
#### Sohei Iwahori (GREE, Inc.)

## Agenda

- クラウド監視の現状とアラート増加の背景
- background story
- unactionableなアラート(静観アラート)の増加と影響
- どのようにしてアラートをアクション可能にしていくか?
- 定期的なモニタリング
- why unactionable?に対応していく
- Recap

Sohei Iwahori

January 25, 2020
Tweet

More Decks by Sohei Iwahori

Other Decks in Technology

Transcript

  1. Practices for Making
    Alerts Actionable
    2020/01/25 SRE NEXT
    Sohei Iwahori (GREE, Inc.)

    View Slide

  2. Agenda
    » Ϋϥ΢υ؂ࢹͷݱঢ়ͱΞϥʔτ૿Ճͷഎܠ
    » background story
    » unactionableͳΞϥʔτʢ੩؍Ξϥʔτʣͷ૿ՃͱӨڹ
    » ͲͷΑ͏ʹͯ͠ΞϥʔτΛΞΫγϣϯՄೳʹ͍͔ͯ͘͠ʁ
    » ఆظతͳϞχλϦϯά
    » why unactionable?ʹରԠ͍ͯ͘͠
    » Recap

    View Slide

  3. Ϋϥ΢υ؂ࢹͷݱঢ়ͱ
    Ξϥʔτ૿Ճͷഎܠ

    View Slide

  4. background story (1/̎)
    » 2015೥ࠒ͔Βେن໛ͳΦϯϓϨ->Ϋϥ΢υʢAWSʣҠߦΛ࣮ࢪ
    » طଘλΠτϧͰҠߦ͕ग़དྷΔ΋ͷΛΦϯϓϨ͔ΒҠߦ +
    ৽نλΠτϧ͸جຊతʹΫϥ΢υͰߏங
    » Ҡߦʹ߹Θͤ؂ࢹγεςϜΛ৽نʹߏங
    » ࠓճ͸ओʹΫϥ΢υ؀ڥͰՔಇ͢ΔλΠτϧ͕΄ͱΜͲͱͳͬͨ2018
    ೥ࠒʹ࣮ࢪͨ͠ΞϥʔτରԠͷఆظϞχλϦϯάͱվળ+ैདྷͷऔΓ
    ૊ΈΛ͋Θ͓ͤͯ࿩͠·͢

    View Slide

  5. background story (2/2)
    » ݱঢ়ͷ؀ڥͷن໛ײ
    » جຊ͸αʔϏεʢେମ͸ήʔϜλΠτϧ͝ͱɺϦʔδϣϯ͕෼͔Ε
    ͍ͯΔ৔߹͸جຊతʹผΞΧ΢ϯτͰఏڙʣ
    » 50+ Production AWSΞΧ΢ϯτ
    » Total 2000+ Ծ૝ϗετ
    » ϞχλϦϯάͷηοτ͸VPC≒ΞΧ΢ϯτ͝ͱʹଘࡏ
    » 1؀ڥ͋ͨΓ͸ଟͯ͘਺ඦ୆ن໛Ҏ಺

    View Slide

  6. Ϋϥ΢υ؀ڥͷ؂ࢹγεςϜ

    View Slide

  7. ݱࡏͷΫϥ΢υ؀ڥ؂ࢹߏ੒

    View Slide

  8. Ϋϥ΢υ؀ڥͷ؂ࢹ(1/2)
    » Prometheus + Grafana
    » gangiʢ಺੡exporterʣ
    » ΦϯϓϨͰ࢖༻͍ͯ͠ΔಠࣗϝτϦΫεऔಘͷͨΊͷganglia
    pluginΛ࠶ར༻͢ΔͨΊͷ΋ͷ
    » gmondͱ͍͏gangliaͷΤʔδΣϯτ͔ΒσʔλΛऔಘ͠
    prometheusͰpullग़དྷΔΑ͏ʹ͢Δ
    » fluentdʢϩά؂ࢹ+Ξϥʔτ఻ୡʣ

    View Slide

  9. Ϋϥ΢υ؀ڥͷ؂ࢹ(2/2)
    » amanekoʢ಺੡؂ࢹΤʔδΣϯτʣ
    » cronʹΑΔ࣮ߦͰ֤छνΣοΫϓϥάΠϯΛ࣮ߦ͢Δ΋ͷ
    » yusuraʢ಺੡ΞϥʔτίϯτϩʔϧγεςϜʣ
    » ू໿ͨ͠ΞϥʔτͷৼΓ෼͚ɺ཈੍ɺαϚϥΠζΛߦ͏ϫʔΧʔ
    » ϓϥΨϒϧͳΞϥʔτ௨஌ػߏʢSlackɺPagerDutyɺJIRAɺ
    Amazon SNS౳ʣ

    View Slide

  10. Ϋϥ΢υҠߦʹΑΓͲ͏ͳ͔ͬͨʁ

    View Slide


  11. Ξϥʔτͷ૿Ճ

    View Slide

  12. View Slide

  13. Why so many alerts?
    » Ҡߦલ͔Βྺ࢙తʹΞϥʔτϧʔϧ͸ࡉ͔͘ઃఆ͞Ε͍ͯͨ
    » Ξϥʔτϧʔϧ͸جຊతΦϯϓϨ؀ڥΛ౿ऻ
    » ωοτϫʔΫͷҰ࣌తͳૄ௨ෆՄ౳ʹΑΔ૿Ճ
    » ΦʔτεέʔϦϯάʹΑΔӨڹ
    » ୆਺ௐ੔͕ΞάϨογϒʹ
    » εέʔϧΠϯɺΞ΢τ࣌ͷ໰୊ʹΑΔ૿Ճ
    » ͬ͘͟ΓΦϯϓϨ༝དྷͷϧʔϧͰ͸ݕ஌͞ΕΔࣄ৅͕ଟ͘ͳͬͨ

    View Slide

  14. Why unactionable?
    » ྺ࢙తͳܦҢʹΑΔʮݕ஌ʯͷͨΊͷΞϥʔτઃఆ
    » ݕ஌͠ɺͦͷޙΞΫγϣϯΛ൑அ͢ΔΑ͏ͳΞϥʔτ
    » ʮҰ࣌తͳ͖͍͠஋௒աʯ͕ߴස౓Ͱൃੜ
    » ݁Ռʮ੩؍ʯରԠͱͳΔΞϥʔτ͕ଟ͘ͳͬͨ
    » ੩؍ʹ෮چΞΫγϣϯͷඞཁͳ͘ࣄ৅͕͓͞·ͬͨͷͰղܾͱ͢
    Δ΋ͷ

    View Slide

  15. Why we should resolve?
    Every time the pager goes off, I should be able to react with a sense of
    urgency. I can only react with a sense of urgency a few times a day
    before I become fatigued.
    -- Site Reliability Engineering / Chapter 6 - Monitoring
    Distributed Systems
    » αʔϏεϨϕϧΛߴΊΔͨΊͷ۩ମతͳΞΫγϣϯΛՄೳͱ͢Δखஈ
    ͱͯ͠ͷΞϥʔτͰ͋Δ΂͖

    View Slide

  16. ͲͷΑ͏ʹͯ͠Ξϥʔτ
    ΛΞΫγϣϯՄೳʹͯ͠
    ͍͔͘ʁ

    View Slide

  17. ͳͥΞΫγϣϯ͢Δ͜ͱ͕Ͱ͖ͳ͍ͷ͔ʁ
    » ͦ΋ͦ΋On-callΞϥʔτ͕ଟ͗͢Δ
    » ੩؍ɺ͋Δ͍͸ఆܕతͳରԠ͔͍ͯ͠͠ͳ͍
    » ΞΫγϣϯΛى͜͢ඞཁ͸͋Δ͕ɺۓٸͰ͸ͳ͍
    » ʮݕ஌ʯΛ͍ͨ͠ͱ͍͏ཧ༝ʹΑΓઃఆ͞Ε͍ͯΔ
    » ΞϥʔτΛड͚͕ͨ൑அ͕೉͍͠

    View Slide

  18. ఆظతͳϞχλϦϯά
    » ݄͝ͱʹOn-callΞϥʔτΛूܭ͠
    ͯSlackʹ௨஌͢Δ
    » PagerDutyͷIncidentΛ
    SumoLogicͰूܭͯ͠௨஌
    » ্Ґ10݅Λର৅ʹͲ͏ѻ͏΂͖͔Λ
    ຖ݄ݕ౼

    View Slide

  19. ରԠΛܾΊ͍ͯ͘
    » ͦ΋ͦ΋On-callΞϥʔτͰ͋Δඞཁ͕͋Δ͔ʁ
    » ൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏
    » ਓͷखͰରԠ͢Δඞཁ͕͋Δ͔ʁ
    » ࣗಈ෮چͷݕ౼
    » ରԠͷج४͕໌֬Ͱ͋Δ͔ʁ
    » ΞΫγϣϯʹ݁ͼͭ͘ࢦඪͷ੔උ
    » ΞΫγϣϯ͠΍͍͢࢓૊Έͮ͘Γ

    View Slide

  20. ൑ఆ৚݅ɺৼΓ෼͚Λద੾ʹߦ͏

    View Slide

  21. ൑ఆ৚݅Λద੾ʹߦ͏
    » ݫ͗͢͠ΔΞϥʔτ৚݅ͷ؇࿨
    » Ұ࣌తͳεύΠΫɺޡݕ஌Λ཈͑ΔରԠ
    » prometheusͷruleʹ͓͚Δfor x minͰͷظؒʹΑΔ൑ఆ
    » νΣοΫܥͷ؂ࢹͷoccurrencesʹΑΔ࿈ଓݕ஌൑ఆͳͲ

    View Slide

  22. ৼΓ෼͚Λద੾ʹߦ͏
    » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ൃੜ͍ͯ͠Δ͜ͱΛݕ஌͍ͨ͠
    » slackʹͷΈ௨஌ʢྲྀྔʹ஫ҙ͢Δʣ
    » ௚ͪʹΞΫγϣϯ͸ඞཁͳ͍͕ԿΒ͔ͷରԠ͸ߦ͍͍ͨ
    » Ξϥʔτ͔ΒJIRAνέοτΛࣗಈىථ͢Δ
    » ͦͷ৔Ͱ௚ͪʹ۩ମతͳΞΫγϣϯ͕ඞཁ
    » PagerDuty΁௨஌

    View Slide

  23. slack௨஌ + JIRAνέοτىථ

    View Slide

  24. ༨ஊɿscheduled-event-
    notifier
    » Ξϥʔτ͔Βͷνέοτىථͷ࢓૊
    ΈΛྲྀ༻ͯ͠ɺରԠ͕ඞཁͳAWSϦ
    ιʔεͷϝϯςφϯε௨஌ΛࣗಈͰ
    νέοτԽͯ͠ӡ༻νʔϜͰ؅ཧ

    View Slide

  25. View Slide

  26. ࣗಈ෮چͷݕ౼

    View Slide

  27. άϦʔʹ͓͚Δࣗಈ෮چ
    » Alert Operator
    » ΞϥʔτछผʹԠͯ͡AWS SSMͰఆܕతͳ෮چίϚϯυΛ࣮ߦ͢
    ΔAWS Lambda Function
    » ϓϩηεͷ࠶ىಈͳͲ୯७ͳΞΫγϣϯΛࢼߦ͢Δ
    » ਓؒʹ௨஌͞ΕΔͷͱಉ͡Ξϥʔτͷ৘ใΛडऔΓɺ
    ରԠ͢ΔίϚϯυΛ࣮ߦɺ݁ՌΛνϟοτ௨஌͢Δ

    View Slide

  28. Alert Operator࣮ߦྫ
    ྫ͑͹͜ͷΑ͏ͳίϚϯυΛ࣮ߦ͢Δ
    service xxx status
    sleep 10
    service xxx restart
    service xxx status

    View Slide

  29. ࣗಈ෮چͷಋೖϓϥΫςΟε
    » ೖΕ΍͍͢ͱ͜Ζ͔Β
    » ෭࡞༻͕ͳ͍ɺίϚϯυ࣮ߦલΑΓঢ়ଶ͕ѱ͘ͳΒͳ͍
    » ࣗಈରԠ͕ࣦഊͨ͠৔߹͸ผϧʔϧͰݕ஌
    » ࣗಈରԠ͕࣮ߦ͞ΕΔΞϥʔτϧʔϧͱɺਓ͕ؒΈΔϧʔϧΛ෼
    ͚ΔͳͲ

    View Slide

  30. ରԠج४ͷ໌֬Խ

    View Slide

  31. άϦʔඪ४ࢦඪͱͯ͠ͷSysLoad
    » ࣾ಺Ͱ͸ྺ࢙తʹαʔόෛՙʹରͯ͠ͷڞ௨ࢦඪͱͯ͠SysLoadΛఆ
    ٛɺར༻͍ͯ͠Δ
    » https://github.com/gree/sysload
    » The Four Golden Signalsʹ͓͚ΔSaturationΛՄࢹԽ͢ΔϝτϦ
    Ϋε
    » ʮSysLoad͕100ʹͳ͍ͬͯΔ৔߹͍ͣΕ͔ͷϦιʔε͕๞࿨͍ͯ͠
    Δʯ͜ͱ͕୭͕ݟͯ΋Θ͔Δ

    View Slide

  32. SysLoadͷఆٛ
    sysload evaluates the following three elements. The maximum value of
    each is 100.
    ALL CPU utilization
    disk I/O utilization
    CPU Utilization in which interrupt from NIC is occurring
    » ͍ͣΕ͔ͷαʔόϦιʔε͕๞࿨ঢ়ଶʹ͋Δ=ߴෛՙͰ͋Δ͜ͱ͕Θ
    ͔Δ
    » ࠷ॳͷ੾Γ෼͚ϙΠϯτͱͯ͠࢖༻Ͱ͖Δ

    View Slide

  33. SysLoadάϥϑͷӡ༻
    » SysLoadෳ߹άϥϑʹ͸ิॿઢ͕͋Γɺ੺ͷϥΠϯ=80%Λ௒͍͑ͯ
    Δ৔߹͸ରԠΞΫγϣϯΛऔΔ΂͖Ͱ͋Δ͜ͱ͕Θ͔Δ

    View Slide

  34. ϩʔϧ͝ͱͷμογϡϘʔυ੔උ

    View Slide

  35. ͦͷଞɺΞΫγϣϯΛิॿ͢Δ࢓૊Έ
    » chatbot
    » खಈखॱΛஔ͖׵͑Δखஈͱͯ͠
    » AWSͷΦʔτεέʔϦϯάʹର͢Δૢ࡞ͳͲ
    » [wip]௨஌ͱͱ΋ʹrunbookΛఏڙ
    » cookpad͞Μ͕औΓ૊·Ε͍ͯͨΑ͏ͳ΋ͷ
    » طଘͷखॱॻͱซ༻͢Δ૝ఆͰ੔උத

    View Slide

  36. Recap

    View Slide

  37. Recap
    » ΞϥʔτͷܭଌΛߦ͏
    » औΔ΂͖ΞΫγϣϯʹ߹Θͤͯద੾ͳৼΓ෼͚Λݕ౼͢Δ
    » ఆܕతͳΞΫγϣϯ͸ࣗಈରԠΛݕ౼͢Δ
    » Ξϥʔτ௨஌ʹܦ࿏ɺ௥ՃΛ͠΍͍͢ػߏΛೖΕ͓ͯ͘
    » ΞΫγϣϯΛऔΓ΍͍͢ࢦඪɺ࢓૊ΈɺखॱΛ༻ҙ͢Δ

    View Slide

  38. thank you for listening

    View Slide

  39. who?
    » Sohei Iwahori (@egmc)
    » GREE, Inc.
    » Πϯϑϥ / Monitoring Unit
    Leader
    » ήʔϜͷΠϯϑϥͱαʔό؂ࢹɺओ
    ʹAWS

    View Slide

  40. Appendix
    » ࣮ફɹࣗಈ෮چ(hiroaki.kobayashi)
    » https://www.slideshare.net/greetech/ss-140561840
    » ̒೥͘Β͍લʹࣗ࡞ͨ͠ metric ͕ͦͦ͜͜༗༻ͩͱࢥ͏ͷͰɺOSS
    Ͱެ։͠·͢(takanori.sejima)
    » https://labs.gree.jp/blog/2018/12/17645/
    » sysload΍؂ࢹͳͲͷ࿩ʢԾʣ(takanori.sejima)
    » https://www.slideshare.net/takanorisejima/
    sysload-133365308

    View Slide

  41. Appendix
    » SQSɺElastiCacheɺLambdaͰ࡞ΔߴՄ༻ͳΞϥʔτ௨஌γεςϜ
    (sohei.iwahori)
    » https://labs.gree.jp/blog/2017/05/16483/

    View Slide