Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice

rrreeeyyy
November 21, 2017

Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice

#hbstudy 79 で Prometheus の話をしました

rrreeeyyy

November 21, 2017
Tweet

More Decks by rrreeeyyy

Other Decks in Technology

Transcript

  1. Prometheus ࣮ફೖ໳
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 1

    View Slide

  2. Agenda
    • Prometheus ʹ͍ͭͯ
    • Prometheus ͷ৑௕Խʹ͍ͭͯ
    • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
    • Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ
    • Alertmanager ʹ͍ͭͯ
    • Alertmanager ͷ৑௕Խʹ͍ͭͯ
    • Exporter ʹ͍ͭͯ
    • ࣮ࡍͷ؂ࢹͰ࢖͑ͦ͏ͳ Exporter ʹ͍ͭͯ
    • Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2

    View Slide

  3. Prometheus ʹ͍ͭͯ
    • Prometheus ͸ OSS ͷϞχλϦϯάπʔϧ
    • ݱࡏͷ࠷৽όʔδϣϯ͸ 2.0.0 (11/8 ϦϦʔε)
    • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ
    • Borgmon ʹ͍ͭͯ͸ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ
    • ࣍ͷΑ͏ͳಛ௃͕͋Δ
    • Pull ܕͷΞʔΩςΫνϟ
    • ͦΕͳΓʹߴ଎ͳ࣌ܥྻσʔλϕʔε
    • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3

    View Slide

  4. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4

    View Slide

  5. ͳͥ Prometheus Λબ୒͢Δͷ͔
    • ߴ͍෼ղೳͰͷϝτϦΫεͷอଘʹ଱͑ΒΕΔ
    • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏ੒Ͱӡ༻Ͱ͖Δ
    • Service Discovery ͕ॆ࣮͍ͯ͠Δ
    • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ
    • CNCF ೖΓΛՌͨ͠ Kubernetes ౳ͷ࿈ܞ΋Մೳ
    • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷ͸ॏཁ
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5

    View Slide

  6. Prometheus ͷઃఆʹ͍ͭͯ
    • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘
    • ؂ࢹ͢Δର৅Λ scrape_configs Ͱॻ͍͍͚ͯͩ͘
    • جຊతʹ͸૿ݮʹରԠͰ͖ΔΑ͏ʹ *_sd_config Λ࢖͏Α͏ʹ͢Δ
    • ରԠ͢Δ sd ͕ͳ͍࣌͸ file_sd_config ౳Ͱ୅ସͰ͖ΔՄೳੑ͕͋Δ
    • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ
    • μογϡϘʔυͳͲ͸جຊతʹ Grafana Λ࢖ͬͯ࡞ΔΑ͏ʹ͢Δ
    • Datasource Λ Prometheus ʹͯ͠ඳը͢Δର৅Λ PromQL Ͱॻ͚Δ
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6

    View Slide

  7. ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ
    - job_name: 'node'
    ec2_sd_configs:
    - region: ap-northeast-1
    port: 9100
    relabel_configs:
    - source_labels: [__meta_ec2_instance_state]
    regex: ^running$ # running ͷ෺͚ͩ
    action: keep
    - source_labels: [__meta_ec2_tag_Role]
    regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔ΋ͷ͚ͩ
    action: keep
    - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ
    target_label: instance # PromQL ͰͷߜΓࠐΈ৚݅ͱͯ͠ɺ
    - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ
    target_label: role
    - source_labels: [__meta_ec2_tag_Status]
    target_label: status
    - source_labels: [__meta_ec2_instance_type]
    target_label: instance_type
    - source_labels: [__meta_ec2_availability_zone]
    target_label: availability_zone
    - source_labels: [__meta_ec2_vpc_id]
    target_label: vpc_id
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7

    View Slide

  8. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8

    View Slide

  9. hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9

    View Slide

  10. Prometheus ͷ৑௕Խʹ͍ͭͯ
    • Prometheus ͷ৑௕Խ͸୯७ʹαʔόΛ 2 ୆ىಈ͢Δ͚ͩ 1
    • Pull ܕͳͷͰ 2 ୆ىಈ͓͚ͯͩ͘͠Ͱ৑௕ԽʹͳΔ
    • σʔλ͸࠷େͰ scrape_interval ͕ͣΕͨ෼͚ͩͣΕΔ
    • ݱ࣮ʹ໰୊ʹͳΔ͜ͱ͸গͳ͍
    • ࣮ࡍʹ͸ϑϩϯτʹ Nginx ౳Λઃஔͯ͠ยํ͕མͪͨΒ΋͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ
    • άϥϑͷඳըʹ࢖͏ Grafana ౳͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ
    1 h$ps:/
    /github.com/prometheus/prometheus/issues/1500
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10

    View Slide

  11. Prometheus ͷεέʔϧઓུʹ͍ͭͯ
    • ϝτϦΫε਺͕਺ඦສ͙Β͍·Ͱ͸ 1 ηοτͰ΋े෼ࡹ͚Δ͸ͣ
    • ૿͖͑ͯͨ৔߹ DC ΍ো֐υϝΠϯຖʹ 1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2
    • ෳ਺ͷ Prometheus Λ༻ҙͨ͠৔߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ
    • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ
    • େମͷ৔߹ԼҐͷ Prometheus Ͱ Record Λ࢖͍σʔλΛू໿্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ
    • ΋͘͠͸ Grafana ౳Ͱࢀর͢ΔσʔλιʔεΛ෼͚ΔͳͲ͕ߟ͑ΒΕΔ
    • ྫ͑͹ CloudFlare Ͱ͸ίϩέʔγϣϯຖʹσʔλΛू໿ͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3
    3 h$ps:/
    /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf
    2 h$ps:/
    /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11

    View Slide

  12. Digression: Record ʹ͍ͭͯ
    • Prometheus Ͱ͸ Recording rule ͱ͍͏΋ͷΛఆٛग़དྷΔ 4
    • Recording rule ͸ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ
    • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ
    • ࣌ܥྻσʔλͷαϯϓϦϯά΍ϑΣσϨʔγϣϯ࣌ͷू໿౳ʹ࢖͏
    • ࣮ߦִؒ͸ Rule ಺ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ
    • Record Ͱఆٛͨ͠஋͸ Alert rule Ͱ΋ར༻Մೳ
    4 h$ps:/
    /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12

    View Slide

  13. Digression: Record/Alert ͷྫ
    groups:
    - name: mysql.rules
    rules:
    - record: mysql_slave_lag_seconds
    expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay
    - alert: MySQLReplicationLag
    expr: (mysql_slave_lag_seconds > 30) and
    ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0)
    for: 1m
    labels:
    severity: critical
    annotations:
    description: The mysql slave replication has fallen behind and is not recovering
    summary: MySQL slave replication is lagging
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13

    View Slide

  14. Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ
    • Prometheus ͸࣌ܥྻσʔλΛ௒௕ظؒอଘ͢Δͷʹ͸͋·Γద͍ͯ͠ͳ͍ 5
    • ߴ଎ͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍໿
    • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ͸ 15 ೔ؒ
    • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ঑͞Ε͍ͯΔ 6
    • ࣮ࡍʹ͸ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ
    • InfluxDB ΍ S3 ΍ Chronix Λ remote storage ͱ͢Δ࣮૷͕ଘࡏ͍ͯ͠Δ
    • Prometheus ͷઃఆͷ remote_read ΍ remote_write Ͱઃఆ͢Δ
    • ΋͘͠͸ storage.tsdb.retention Λ௕ͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ
    6 h$ps:/
    /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons
    5 h$p:/
    /techlife.cookpad.com/entry/7meseries-database-001
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14

    View Slide

  15. Alertmanager ʹ͍ͭͯ
    • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔ΋ͷ 7
    • Ξϥʔτͷάϧʔϐϯά, ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ
    • Ξϥʔτͷݕࡧɾ௨஌ͷ཈ࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ
    • Prometheus Ͱͳͯ͘΋࣮͸ಈ͘ 8
    • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ
    8 h$ps:/
    /prometheus.io/docs/aler5ng/clients/
    7 h$ps:/
    /prometheus.io/docs/aler5ng/alertmanager/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15

    View Slide

  16. Alertmanager ͷઃఆʹ͍ͭͯ
    • Πϯετʔϧ͸جຊతʹ͸όΠφϦΛஔ͚ͩ͘
    • Ξϥʔτͷ௨஌ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ
    • Ξϥʔτͷϧʔϧࣗମ͸ Prometheus ͷํʹఆٛ͢Δ
    • Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ
    • جຊతʹ͸ϥϕϧͷ஋Λݩʹͯ͠௨஌ઌΛܾఆ͢Δ
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16

    View Slide

  17. Prometheus ͷ Alert rule ͷઃఆྫ
    groups:
    - name: linux
    rules:
    - alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ౳Ͱ࢖ΘΕΔ
    expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢ஋ͱͯ͠࢖ΘΕΔ PromQL ͷ஋
    for: 1m # 1 ෼ؒҎ্ܧଓͨ͠৔߹ʹ alertmanager ʹ౉Δ
    labels: # ͜ͷ஋͕ Alertmanager ଆͰར༻Մೳ
    severity: CRITICAL
    annotations: # Slack ౳Ͱ௨஌͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ
    description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
    summary: Instance {{ $labels.instance }} down
    - alert: CPUUtilization
    expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60
    for: 1m
    labels:
    severity: CRITICAL
    annotations:
    description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.'
    summary: Instance {{ $labels.instance }} cpu utilization is high
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17

    View Slide

  18. global:
    resolve_timeout: 5m
    route:
    group_by: ['alertname', 'instance'] # receiver ʹ௨஌͢Δ৚݅ʹઃఆ
    group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹ଴ͭඵ਺
    group_interval: 5m # άϧʔϓʹରͯ͠௨஌Λߦ͏ִؒ
    # ࠷ॳ 30 ඵ଴ͬͯ௨஌->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε͹ 5 ෼ຖʹ௨஌
    repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚Ε͹Կ΋ͳ͘ͱ΋ 1h ຖʹ௨஌)
    routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ
    - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ
    service: ^sre$
    receiver: 'sre-pagerduty'
    receivers: # ΞϥʔτΛड͚औΔର৅ͷઃఆ
    - name: 'sre-page' # webhook, email, pagerduty ౳͕࢖͑Δ
    pagerduty_configs:
    - service_key: xxxxxxxxxxxxxxxxxxxxxxxx
    inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ
    - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡,
    severity: 'critical' # critical ͷ alert ͕͋Δ৔߹ɺ
    target_match: # warning ͷ෺͸Ϛʔδ͞ΕͯऔΓѻΘΕΔ
    severity: 'warning'
    equal: ['alertname', 'instance']
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18

    View Slide

  19. Alertmanager ͷ৑௕Խʹ͍ͭͯ
    • Alertmanager ͷ৑௕Խ͸ -mesh ΦϓγϣϯΛ࢖͏͜ͱͰՄೳ
    • جຊతʹશͯͷϊʔυͰࣗ෼ΛؚΊͯ -mesh.peer Λෳ਺ճࢦఆ͢Δ
    • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002
    • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕΍ΓͱΓΛ։࢝͢Δ
    • Prometheus ͷ alerting ઃఆ߲໨ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ
    • ಺෦తʹ͸ weaveworks/mesh 9 ͕࢖༻͞Εͯ৑௕Խ͕࣮ݱ͞Ε͍ͯΔ
    • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ
    • ωοτϫʔΫతʹ෼அ͞Εͨ৔߹ͳͲ͸Ξϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ
    9 h$ps:/
    /github.com/weaveworks/mesh
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19

    View Slide

  20. ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ
    alerting:
    alertmanagers:
    - ec2_sd_configs: # alertmanager ࣗମͷ
    - region: ap-northeast-1 # service discovery ΋ग़དྷΔ
    port: 9093
    relabel_configs:
    - source_labels: [__meta_ec2_instance_state]
    regex: ^running$ # running ͷ΋ͷ
    action: keep
    - source_labels: [__meta_ec2_tag_Role]
    regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔ΋ͷ
    action: keep
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20

    View Slide

  21. Exporter ʹ͍ͭͯ
    • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏
    • ༻్΍औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10
    • node_exporter: Linux ͷඪ४తͳϝτϦΫε
    • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε
    • nginx_exporter: nginx_status ͷϝτϦΫε
    • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹม׵Ͱ͖Δ
    • snmp_exporter: SNMP ͷ஋͔ΒϝτϦΫεʹม׵Ͱ͖Δ
    10 h%ps:/
    /github.com/prometheus/prometheus/wiki/Default-port-allocahbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21

    View Slide

  22. Exporter Λࣗ࡞͢Δ
    • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ 11
    • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Ε͹ྑ͍
    • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ΋؆୯ʹऩूͰ͖Δ
    • جຊతʹ exporter ଆͰ͸ raw ͳ஋Λग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ
    • ΋͘͠͸ protocol buffer ͷϑΥʔϚοτ΋͋Δ
    • ͳ͍΋ͷ͸࡞ΔࣄʹͳΔ͕ݴޠ΋റΓ͕ͳ͘ϑΥʔϚοτ΋؆୯ͳͷͰ೉͘͠ͳ͍
    • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ಺ͷϝτϦΫεΛग़ྗ͢Δ෺Λ࡞ͬͨΓ͍ͯ͠Δ
    11 h$ps:/
    /prometheus.io/docs/instrumen4ng/exposi4on_formats/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22

    View Slide

  23. ࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ
    $ curl localhost:9090/metrics
    # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    # TYPE go_gc_duration_seconds summary
    go_gc_duration_seconds{quantile="0"} 5.9729e-05
    go_gc_duration_seconds{quantile="0.25"} 9.75e-05
    go_gc_duration_seconds{quantile="0.5"} 0.000117034
    go_gc_duration_seconds{quantile="0.75"} 0.000157237
    go_gc_duration_seconds{quantile="1"} 0.0067897
    go_gc_duration_seconds_sum 10.408703235
    go_gc_duration_seconds_count 33117
    # HELP go_goroutines Number of goroutines that currently exist.
    # TYPE go_goroutines gauge
    go_goroutines 54
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23

    View Slide

  24. Digression: Exporter ϙʔτരൃ໰୊
    • Prometheus ͷ Wiki 9 ΛݟΕ͹෼͔Δ௨Γ 1 Exporter 1 ϙʔτΛ࢖͏
    • 1 ͭͷΠϯελϯεʹෳ਺ͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ࢖͏
    • ౎౓ sg ͳͲͷϑΝΠΞ΢ΥʔϧͷઃఆΛ͢Δͷ͸໘౗
    • ͋·Γෳ਺ͷϙʔτΛ Prometheus ʹ޲͚ͯެ։͢Δඞཁ͸ͳ͍
    • rrreeeyyy/exporter_proxy 12 ͳͲΛ࢖ͬͯղܾ͢Δ
    • ಛఆͷϙʔτΛ࢖ͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λ൑ผ͢Δ
    12 h%ps:/
    /github.com/rrreeeyyy/exporter_proxy
    9 h$ps:/
    /github.com/weaveworks/mesh
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24

    View Slide

  25. PromQL ʹ͍ͭͯ
    • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ࢖༻͢ΔΫΤϦݴޠ
    • ׳ΕΔ·Ͱ͸΍΍೉͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར
    • ೖ໳͸ެࣜυΩϡϝϯτͱݸਓతʹ͸ DigitalOcean ͷࢿྉ͕ྑ͔ͬͨ 13 14
    • Alering ΋ PromQL Λར༻ͯ͠ߦ͏
    • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ
    • Aler:ng ͷ࣌͸ irate() Ͱ͸ͳ͘ rate() Λ࢖ͬͨ΄͏͕ྑ͍ͳͲͷ஫ҙ఺΋͋Δ
    14 h%ps:/
    /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2
    13 h%ps:/
    /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25

    View Slide

  26. CPU ࢖༻཰Λܭࢉ͢Δ PromQL
    • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ࢖༻཰͸࣍ͷΑ͏ʹॻ͚Δ 15
    • 100% ͔Β idle ͷ஋ΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ
    • node_cpu ʹ͸ CPU ίΞຖͷ஋͕ೖ͍ͬͯΔͨΊ
    • Alert Rule ʹ͢Δ৔߹ irate Λ rate ʹ͠ɺ຤ඌʹ >60 ౳ͷᮢ஋Λॻ͘
    100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100)
    15 h%ps:/
    /www.robustpercep3on.io/understanding-machine-cpu-usage/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26

    View Slide

  27. Disk ࢖༻཰ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16
    - name: node.rules
    rules:
    - alert: DiskWillFillIn4Hours
    expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0
    for: 5m
    labels:
    severity: page
    • predict_linear ౳ͷઢܗճؼ͕࢖͑ΔͷͰ 4 ࣌ؒޙʹσΟ
    εΫ࢒༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳ΋ͷΛΞϥʔτग़དྷΔ
    16 h%ps:/
    /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27

    View Slide

  28. Digression: Rule ϑΝΠϧͷ؅ཧʹ͍ͭͯ
    • Alert rule ͷ؅ཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ
    • Rule ϑΝΠϧ͸γϯϓϧͳ YAML Ͱॻ͔ΕΔ
    • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹ΍΍ෆ଍Λײ͡Δ
    • Role ΍ Template ΍ Macro ͕࢖͍͍ͨ...
    • ਖ਼௚ͳͱ͜ΖΉ͠Ζ͓࢖͍ͷօ͞Μ͕Ͳ͏؅ཧ͍ͯ͠Δͷ͔஌Γ͍ͨ
    • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ΋͢Δ
    • τϦοΩʔͳ͜ͱ͸ͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟ͸Θ͔Δ
    • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͸͋Γͦ͏
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28

    View Slide

  29. ·ͱΊ
    • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠
    • ৑௕Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ
    • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠
    • ৑௕Խɾ࣮ࡍͷઃఆͳͲɹ
    • Exporter ͷࣗ࡞΍ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠
    hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29

    View Slide