Slide 1

Slide 1 text

Prometheus ࣮ફೖ໳ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 1

Slide 2

Slide 2 text

Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷ৑௕Խʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ • Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷ৑௕Խʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷ؂ࢹͰ࢖͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2

Slide 3

Slide 3 text

Prometheus ʹ͍ͭͯ • Prometheus ͸ OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ͸ 2.0.0 (11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ͸ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ௃͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴ଎ͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3

Slide 4

Slide 4 text

hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4

Slide 5

Slide 5 text

ͳͥ Prometheus Λબ୒͢Δͷ͔ • ߴ͍෼ղೳͰͷϝτϦΫεͷอଘʹ଱͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏ੒Ͱӡ༻Ͱ͖Δ • Service Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ౳ͷ࿈ܞ΋Մೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷ͸ॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5

Slide 6

Slide 6 text

Prometheus ͷઃఆʹ͍ͭͯ • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘ • ؂ࢹ͢Δର৅Λ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ͸૿ݮʹରԠͰ͖ΔΑ͏ʹ *_sd_config Λ࢖͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌͸ file_sd_config ౳Ͱ୅ସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲ͸جຊతʹ Grafana Λ࢖ͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢Δର৅Λ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6

Slide 7

Slide 7 text

ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1 port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ෺͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔ΋ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ৚݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7

Slide 8

Slide 8 text

hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8

Slide 9

Slide 9 text

hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9

Slide 10

Slide 10 text

Prometheus ͷ৑௕Խʹ͍ͭͯ • Prometheus ͷ৑௕Խ͸୯७ʹαʔόΛ 2 ୆ىಈ͢Δ͚ͩ 1 • Pull ܕͳͷͰ 2 ୆ىಈ͓͚ͯͩ͘͠Ͱ৑௕ԽʹͳΔ • σʔλ͸࠷େͰ scrape_interval ͕ͣΕͨ෼͚ͩͣΕΔ • ݱ࣮ʹ໰୊ʹͳΔ͜ͱ͸গͳ͍ • ࣮ࡍʹ͸ϑϩϯτʹ Nginx ౳Λઃஔͯ͠ยํ͕མͪͨΒ΋͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ࢖͏ Grafana ౳͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10

Slide 11

Slide 11 text

Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε਺͕਺ඦສ͙Β͍·Ͱ͸ 1 ηοτͰ΋े෼ࡹ͚Δ͸ͣ • ૿͖͑ͯͨ৔߹ DC ΍ো֐υϝΠϯຖʹ 1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳ਺ͷ Prometheus Λ༻ҙͨ͠৔߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ৔߹ԼҐͷ Prometheus Ͱ Record Λ࢖͍σʔλΛू໿্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ΋͘͠͸ Grafana ౳Ͱࢀর͢ΔσʔλιʔεΛ෼͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑͹ CloudFlare Ͱ͸ίϩέʔγϣϯຖʹσʔλΛू໿ͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11

Slide 12

Slide 12 text

Digression: Record ʹ͍ͭͯ • Prometheus Ͱ͸ Recording rule ͱ͍͏΋ͷΛఆٛग़དྷΔ 4 • Recording rule ͸ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯά΍ϑΣσϨʔγϣϯ࣌ͷू໿౳ʹ࢖͏ • ࣮ߦִؒ͸ Rule ಺ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠஋͸ Alert rule Ͱ΋ར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12

Slide 13

Slide 13 text

Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record: mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13

Slide 14

Slide 14 text

Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ͸࣌ܥྻσʔλΛ௒௕ظؒอଘ͢Δͷʹ͸͋·Γద͍ͯ͠ͳ͍ 5 • ߴ଎ͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍໿ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ͸ 15 ೔ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ঑͞Ε͍ͯΔ 6 • ࣮ࡍʹ͸ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB ΍ S3 ΍ Chronix Λ remote storage ͱ͢Δ࣮૷͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read ΍ remote_write Ͱઃఆ͢Δ • ΋͘͠͸ storage.tsdb.retention Λ௕ͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14

Slide 15

Slide 15 text

Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔ΋ͷ 7 • Ξϥʔτͷάϧʔϐϯά, ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨஌ͷ཈ࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳͯ͘΋࣮͸ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15

Slide 16

Slide 16 text

Alertmanager ͷઃఆʹ͍ͭͯ • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨஌ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ͸ Prometheus ͷํʹఆٛ͢Δ • Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹ͸ϥϕϧͷ஋Λݩʹͯ͠௨஌ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16

Slide 17

Slide 17 text

Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules: - alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ౳Ͱ࢖ΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢ஋ͱͯ͠࢖ΘΕΔ PromQL ͷ஋ for: 1m # 1 ෼ؒҎ্ܧଓͨ͠৔߹ʹ alertmanager ʹ౉Δ labels: # ͜ͷ஋͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack ౳Ͱ௨஌͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17

Slide 18

Slide 18 text

global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨஌͢Δ৚݅ʹઃఆ group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹ଴ͭඵ਺ group_interval: 5m # άϧʔϓʹରͯ͠௨஌Λߦ͏ִؒ # ࠷ॳ 30 ඵ଴ͬͯ௨஌->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε͹ 5 ෼ຖʹ௨஌ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚Ε͹Կ΋ͳ͘ͱ΋ 1h ຖʹ௨஌) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔର৅ͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ౳͕࢖͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ৔߹ɺ target_match: # warning ͷ෺͸Ϛʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18

Slide 19

Slide 19 text

Alertmanager ͷ৑௕Խʹ͍ͭͯ • Alertmanager ͷ৑௕Խ͸ -mesh ΦϓγϣϯΛ࢖͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗ෼ΛؚΊͯ -mesh.peer Λෳ਺ճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕΍ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲໨ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ಺෦తʹ͸ weaveworks/mesh 9 ͕࢖༻͞Εͯ৑௕Խ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹ෼அ͞Εͨ৔߹ͳͲ͸Ξϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19

Slide 20

Slide 20 text

ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: # alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ΋ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ΋ͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔ΋ͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20

Slide 21

Slide 21 text

Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ • ༻్΍औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹม׵Ͱ͖Δ • snmp_exporter: SNMP ͷ஋͔ΒϝτϦΫεʹม׵Ͱ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca

Slide 22

Slide 22 text

Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ 11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Ε͹ྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ΋؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ͸ raw ͳ஋Λग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ΋͘͠͸ protocol buffer ͷϑΥʔϚοτ΋͋Δ • ͳ͍΋ͷ͸࡞ΔࣄʹͳΔ͕ݴޠ΋റΓ͕ͳ͘ϑΥʔϚοτ΋؆୯ͳͷͰ೉͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ಺ͷϝτϦΫεΛग़ྗ͢Δ෺Λ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22

Slide 23

Slide 23 text

࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23

Slide 24

Slide 24 text

Digression: Exporter ϙʔτരൃ໰୊ • Prometheus ͷ Wiki 9 ΛݟΕ͹෼͔Δ௨Γ 1 Exporter 1 ϙʔτΛ࢖͏ • 1 ͭͷΠϯελϯεʹෳ਺ͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ࢖͏ • ౎౓ sg ͳͲͷϑΝΠΞ΢ΥʔϧͷઃఆΛ͢Δͷ͸໘౗ • ͋·Γෳ਺ͷϙʔτΛ Prometheus ʹ޲͚ͯެ։͢Δඞཁ͸ͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛ࢖ͬͯղܾ͢Δ • ಛఆͷϙʔτΛ࢖ͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λ൑ผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24

Slide 25

Slide 25 text

PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ࢖༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͸΍΍೉͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖ໳͸ެࣜυΩϡϝϯτͱݸਓతʹ͸ DigitalOcean ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering ΋ PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌͸ irate() Ͱ͸ͳ͘ rate() Λ࢖ͬͨ΄͏͕ྑ͍ͳͲͷ஫ҙ఺΋͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25

Slide 26

Slide 26 text

CPU ࢖༻཰Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ࢖༻཰͸࣍ͷΑ͏ʹॻ͚Δ 15 • 100% ͔Β idle ͷ஋ΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ͸ CPU ίΞຖͷ஋͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ৔߹ irate Λ rate ʹ͠ɺ຤ඌʹ >60 ౳ͷᮢ஋Λॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26

Slide 27

Slide 27 text

Disk ࢖༻཰ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules: - alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ౳ͷઢܗճؼ͕࢖͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ࢒༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳ΋ͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27

Slide 28

Slide 28 text

Digression: Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ • Alert rule ͷ؅ཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ • Rule ϑΝΠϧ͸γϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹ΍΍ෆ଍Λײ͡Δ • Role ΍ Template ΍ Macro ͕࢖͍͍ͨ... • ਖ਼௚ͳͱ͜ΖΉ͠Ζ͓࢖͍ͷօ͞Μ͕Ͳ͏؅ཧ͍ͯ͠Δͷ͔஌Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ΋͢Δ • τϦοΩʔͳ͜ͱ͸ͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟ͸Θ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͸͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28

Slide 29

Slide 29 text

·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • ৑௕Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • ৑௕Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞΍ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29