Upgrade to Pro — share decks privately, control downloads, hide ads and more …

practices-for-making-alerts-actionable.pdf

Sohei Iwahori
January 25, 2020

 practices-for-making-alerts-actionable.pdf

# Practices for Making Alerts Actionable
### 2020/01/25 SRE NEXT
#### Sohei Iwahori (GREE, Inc.)

## Agenda

- クラウド監視の現状とアラート増加の背景
- background story
- unactionableなアラート(静観アラート)の増加と影響
- どのようにしてアラートをアクション可能にしていくか?
- 定期的なモニタリング
- why unactionable?に対応していく
- Recap

Sohei Iwahori

January 25, 2020
Tweet

More Decks by Sohei Iwahori

Other Decks in Technology

Transcript

  1. background story (1/̎) » 2015೥ࠒ͔Βେن໛ͳΦϯϓϨ->Ϋϥ΢υʢAWSʣҠߦΛ࣮ࢪ » طଘλΠτϧͰҠߦ͕ग़དྷΔ΋ͷΛΦϯϓϨ͔ΒҠߦ + ৽نλΠτϧ͸جຊతʹΫϥ΢υͰߏங »

    Ҡߦʹ߹Θͤ؂ࢹγεςϜΛ৽نʹߏங » ࠓճ͸ओʹΫϥ΢υ؀ڥͰՔಇ͢ΔλΠτϧ͕΄ͱΜͲͱͳͬͨ2018 ೥ࠒʹ࣮ࢪͨ͠ΞϥʔτରԠͷఆظϞχλϦϯάͱվળ+ैདྷͷऔΓ ૊ΈΛ͋Θ͓ͤͯ࿩͠·͢
  2. background story (2/2) » ݱঢ়ͷ؀ڥͷن໛ײ » جຊ͸αʔϏεʢେମ͸ήʔϜλΠτϧ͝ͱɺϦʔδϣϯ͕෼͔Ε ͍ͯΔ৔߹͸جຊతʹผΞΧ΢ϯτͰఏڙʣ » 50+

    Production AWSΞΧ΢ϯτ » Total 2000+ Ծ૝ϗετ » ϞχλϦϯάͷηοτ͸VPC≒ΞΧ΢ϯτ͝ͱʹଘࡏ » 1؀ڥ͋ͨΓ͸ଟͯ͘਺ඦ୆ن໛Ҏ಺
  3. Ϋϥ΢υ؀ڥͷ؂ࢹ(1/2) » Prometheus + Grafana » gangiʢ಺੡exporterʣ » ΦϯϓϨͰ࢖༻͍ͯ͠ΔಠࣗϝτϦΫεऔಘͷͨΊͷganglia pluginΛ࠶ར༻͢ΔͨΊͷ΋ͷ

    » gmondͱ͍͏gangliaͷΤʔδΣϯτ͔ΒσʔλΛऔಘ͠ prometheusͰpullग़དྷΔΑ͏ʹ͢Δ » fluentdʢϩά؂ࢹ+Ξϥʔτ఻ୡʣ
  4. Why so many alerts? » Ҡߦલ͔Βྺ࢙తʹΞϥʔτϧʔϧ͸ࡉ͔͘ઃఆ͞Ε͍ͯͨ » Ξϥʔτϧʔϧ͸جຊతΦϯϓϨ؀ڥΛ౿ऻ » ωοτϫʔΫͷҰ࣌తͳૄ௨ෆՄ౳ʹΑΔ૿Ճ

    » ΦʔτεέʔϦϯάʹΑΔӨڹ » ୆਺ௐ੔͕ΞάϨογϒʹ » εέʔϧΠϯɺΞ΢τ࣌ͷ໰୊ʹΑΔ૿Ճ » ͬ͘͟ΓΦϯϓϨ༝དྷͷϧʔϧͰ͸ݕ஌͞ΕΔࣄ৅͕ଟ͘ͳͬͨ
  5. Why we should resolve? Every time the pager goes off,

    I should be able to react with a sense of urgency. I can only react with a sense of urgency a few times a day before I become fatigued. -- Site Reliability Engineering / Chapter 6 - Monitoring Distributed Systems » αʔϏεϨϕϧΛߴΊΔͨΊͷ۩ମతͳΞΫγϣϯΛՄೳͱ͢Δखஈ ͱͯ͠ͷΞϥʔτͰ͋Δ΂͖
  6. άϦʔʹ͓͚Δࣗಈ෮چ » Alert Operator » ΞϥʔτछผʹԠͯ͡AWS SSMͰఆܕతͳ෮چίϚϯυΛ࣮ߦ͢ ΔAWS Lambda Function

    » ϓϩηεͷ࠶ىಈͳͲ୯७ͳΞΫγϣϯΛࢼߦ͢Δ » ਓؒʹ௨஌͞ΕΔͷͱಉ͡Ξϥʔτͷ৘ใΛडऔΓɺ ରԠ͢ΔίϚϯυΛ࣮ߦɺ݁ՌΛνϟοτ௨஌͢Δ
  7. άϦʔඪ४ࢦඪͱͯ͠ͷSysLoad » ࣾ಺Ͱ͸ྺ࢙తʹαʔόෛՙʹରͯ͠ͷڞ௨ࢦඪͱͯ͠SysLoadΛఆ ٛɺར༻͍ͯ͠Δ » https://github.com/gree/sysload » The Four Golden

    Signalsʹ͓͚ΔSaturationΛՄࢹԽ͢ΔϝτϦ Ϋε » ʮSysLoad͕100ʹͳ͍ͬͯΔ৔߹͍ͣΕ͔ͷϦιʔε͕๞࿨͍ͯ͠ Δʯ͜ͱ͕୭͕ݟͯ΋Θ͔Δ
  8. SysLoadͷఆٛ sysload evaluates the following three elements. The maximum value

    of each is 100. ALL CPU utilization disk I/O utilization CPU Utilization in which interrupt from NIC is occurring » ͍ͣΕ͔ͷαʔόϦιʔε͕๞࿨ঢ়ଶʹ͋Δ=ߴෛՙͰ͋Δ͜ͱ͕Θ ͔Δ » ࠷ॳͷ੾Γ෼͚ϙΠϯτͱͯ͠࢖༻Ͱ͖Δ
  9. who? » Sohei Iwahori (@egmc) » GREE, Inc. » Πϯϑϥ

    / Monitoring Unit Leader » ήʔϜͷΠϯϑϥͱαʔό؂ࢹɺओ ʹAWS
  10. Appendix » ࣮ફɹࣗಈ෮چ(hiroaki.kobayashi) » https://www.slideshare.net/greetech/ss-140561840 » ̒೥͘Β͍લʹࣗ࡞ͨ͠ metric ͕ͦͦ͜͜༗༻ͩͱࢥ͏ͷͰɺOSS Ͱެ։͠·͢(takanori.sejima)

    » https://labs.gree.jp/blog/2018/12/17645/ » sysload΍؂ࢹͳͲͷ࿩ʢԾʣ(takanori.sejima) » https://www.slideshare.net/takanorisejima/ sysload-133365308