Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice

28e154e6e0351c70091997d2f574295a?s=47 rrreeeyyy
November 21, 2017

Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice

#hbstudy 79 で Prometheus の話をしました

28e154e6e0351c70091997d2f574295a?s=128

rrreeeyyy

November 21, 2017
Tweet

Transcript

  1. Prometheus ࣮ફೖ໳ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )

    1
  2. Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷ৑௕Խʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ

    • Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷ৑௕Խʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷ؂ࢹͰ࢖͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
  3. Prometheus ʹ͍ͭͯ • Prometheus ͸ OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ͸ 2.0.0

    (11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ͸ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ௃͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴ଎ͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
  4. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4

  5. ͳͥ Prometheus Λબ୒͢Δͷ͔ • ߴ͍෼ղೳͰͷϝτϦΫεͷอଘʹ଱͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏ੒Ͱӡ༻Ͱ͖Δ • Service

    Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ౳ͷ࿈ܞ΋Մೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷ͸ॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
  6. Prometheus ͷઃఆʹ͍ͭͯ • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘ • ؂ࢹ͢Δର৅Λ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ͸૿ݮʹରԠͰ͖ΔΑ͏ʹ

    *_sd_config Λ࢖͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌͸ file_sd_config ౳Ͱ୅ସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲ͸جຊతʹ Grafana Λ࢖ͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢Δର৅Λ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
  7. ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1

    port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ෺͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔ΋ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ৚݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
  8. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8

  9. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9

  10. Prometheus ͷ৑௕Խʹ͍ͭͯ • Prometheus ͷ৑௕Խ͸୯७ʹαʔόΛ 2 ୆ىಈ͢Δ͚ͩ 1 • Pull

    ܕͳͷͰ 2 ୆ىಈ͓͚ͯͩ͘͠Ͱ৑௕ԽʹͳΔ • σʔλ͸࠷େͰ scrape_interval ͕ͣΕͨ෼͚ͩͣΕΔ • ݱ࣮ʹ໰୊ʹͳΔ͜ͱ͸গͳ͍ • ࣮ࡍʹ͸ϑϩϯτʹ Nginx ౳Λઃஔͯ͠ยํ͕མͪͨΒ΋͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ࢖͏ Grafana ౳͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
  11. Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε਺͕਺ඦສ͙Β͍·Ͱ͸ 1 ηοτͰ΋े෼ࡹ͚Δ͸ͣ • ૿͖͑ͯͨ৔߹ DC ΍ো֐υϝΠϯຖʹ

    1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳ਺ͷ Prometheus Λ༻ҙͨ͠৔߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ৔߹ԼҐͷ Prometheus Ͱ Record Λ࢖͍σʔλΛू໿্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ΋͘͠͸ Grafana ౳Ͱࢀর͢ΔσʔλιʔεΛ෼͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑͹ CloudFlare Ͱ͸ίϩέʔγϣϯຖʹσʔλΛू໿ͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
  12. Digression: Record ʹ͍ͭͯ • Prometheus Ͱ͸ Recording rule ͱ͍͏΋ͷΛఆٛग़དྷΔ 4

    • Recording rule ͸ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯά΍ϑΣσϨʔγϣϯ࣌ͷू໿౳ʹ࢖͏ • ࣮ߦִؒ͸ Rule ಺ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠஋͸ Alert rule Ͱ΋ར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
  13. Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:

    mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
  14. Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ͸࣌ܥྻσʔλΛ௒௕ظؒอଘ͢Δͷʹ͸͋·Γద͍ͯ͠ͳ͍ 5 • ߴ଎ͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍໿ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ͸

    15 ೔ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ঑͞Ε͍ͯΔ 6 • ࣮ࡍʹ͸ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB ΍ S3 ΍ Chronix Λ remote storage ͱ͢Δ࣮૷͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read ΍ remote_write Ͱઃఆ͢Δ • ΋͘͠͸ storage.tsdb.retention Λ௕ͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
  15. Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔ΋ͷ 7 • Ξϥʔτͷάϧʔϐϯά,

    ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨஌ͷ཈ࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳͯ͘΋࣮͸ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
  16. Alertmanager ͷઃఆʹ͍ͭͯ • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨஌ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ͸ Prometheus ͷํʹఆٛ͢Δ

    • Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹ͸ϥϕϧͷ஋Λݩʹͯ͠௨஌ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
  17. Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:

    - alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ౳Ͱ࢖ΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢ஋ͱͯ͠࢖ΘΕΔ PromQL ͷ஋ for: 1m # 1 ෼ؒҎ্ܧଓͨ͠৔߹ʹ alertmanager ʹ౉Δ labels: # ͜ͷ஋͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack ౳Ͱ௨஌͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
  18. global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨஌͢Δ৚݅ʹઃఆ

    group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹ଴ͭඵ਺ group_interval: 5m # άϧʔϓʹରͯ͠௨஌Λߦ͏ִؒ # ࠷ॳ 30 ඵ଴ͬͯ௨஌->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε͹ 5 ෼ຖʹ௨஌ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚Ε͹Կ΋ͳ͘ͱ΋ 1h ຖʹ௨஌) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔର৅ͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ౳͕࢖͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ৔߹ɺ target_match: # warning ͷ෺͸Ϛʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
  19. Alertmanager ͷ৑௕Խʹ͍ͭͯ • Alertmanager ͷ৑௕Խ͸ -mesh ΦϓγϣϯΛ࢖͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗ෼ΛؚΊͯ -mesh.peer

    Λෳ਺ճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕΍ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲໨ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ಺෦తʹ͸ weaveworks/mesh 9 ͕࢖༻͞Εͯ৑௕Խ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹ෼அ͞Εͨ৔߹ͳͲ͸Ξϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
  20. ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #

    alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ΋ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ΋ͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔ΋ͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
  21. Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •

    ༻్΍औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹม׵Ͱ͖Δ • snmp_exporter: SNMP ͷ஋͔ΒϝτϦΫεʹม׵Ͱ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
  22. Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ

    11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Ε͹ྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ΋؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ͸ raw ͳ஋Λग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ΋͘͠͸ protocol buffer ͷϑΥʔϚοτ΋͋Δ • ͳ͍΋ͷ͸࡞ΔࣄʹͳΔ͕ݴޠ΋റΓ͕ͳ͘ϑΥʔϚοτ΋؆୯ͳͷͰ೉͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ಺ͷϝτϦΫεΛग़ྗ͢Δ෺Λ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
  23. ࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of

    the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
  24. Digression: Exporter ϙʔτരൃ໰୊ • Prometheus ͷ Wiki 9 ΛݟΕ͹෼͔Δ௨Γ 1

    Exporter 1 ϙʔτΛ࢖͏ • 1 ͭͷΠϯελϯεʹෳ਺ͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ࢖͏ • ౎౓ sg ͳͲͷϑΝΠΞ΢ΥʔϧͷઃఆΛ͢Δͷ͸໘౗ • ͋·Γෳ਺ͷϙʔτΛ Prometheus ʹ޲͚ͯެ։͢Δඞཁ͸ͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛ࢖ͬͯղܾ͢Δ • ಛఆͷϙʔτΛ࢖ͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λ൑ผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
  25. PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ࢖༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͸΍΍೉͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖ໳͸ެࣜυΩϡϝϯτͱݸਓతʹ͸ DigitalOcean

    ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering ΋ PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌͸ irate() Ͱ͸ͳ͘ rate() Λ࢖ͬͨ΄͏͕ྑ͍ͳͲͷ஫ҙ఺΋͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
  26. CPU ࢖༻཰Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ࢖༻཰͸࣍ͷΑ͏ʹॻ͚Δ 15 •

    100% ͔Β idle ͷ஋ΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ͸ CPU ίΞຖͷ஋͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ৔߹ irate Λ rate ʹ͠ɺ຤ඌʹ >60 ౳ͷᮢ஋Λॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
  27. Disk ࢖༻཰ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:

    - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ౳ͷઢܗճؼ͕࢖͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ࢒༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳ΋ͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
  28. Digression: Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ • Alert rule ͷ؅ཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •

    Rule ϑΝΠϧ͸γϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹ΍΍ෆ଍Λײ͡Δ • Role ΍ Template ΍ Macro ͕࢖͍͍ͨ... • ਖ਼௚ͳͱ͜ΖΉ͠Ζ͓࢖͍ͷօ͞Μ͕Ͳ͏؅ཧ͍ͯ͠Δͷ͔஌Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ΋͢Δ • τϦοΩʔͳ͜ͱ͸ͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟ͸Θ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͸͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
  29. ·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • ৑௕Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •

    ৑௕Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞΍ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29