Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus 実践入門 #hbstudy 79 / introduction-to-p...
Search
rrreeeyyy
November 21, 2017
Technology
16
3.3k
Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice
#hbstudy 79 で Prometheus の話をしました
rrreeeyyy
November 21, 2017
Tweet
Share
More Decks by rrreeeyyy
See All by rrreeeyyy
An Efficient Incident Response Training with AI / SRE NEXT 2024 Sponsor Session
rrreeeyyy
1
3.2k
カンファレンスから見る SRE トレンド 2024 / SRE Trends from Conferences in 2024 #SRE_Findy
rrreeeyyy
4
2.1k
信頼性の育て方 / mackerel-meetup-15
rrreeeyyy
9
2.4k
SRE の歩き方・進め方 / sre-walk-through-procedure
rrreeeyyy
0
8.5k
「信頼性」を保ちつつ大規模サービスをリニューアルする / cookpad-tech-kitchen-service-embedded-sres
rrreeeyyy
11
12k
Cookpad and Prometheus
rrreeeyyy
6
20k
SRE-Lounge-8-Cookpad-Microservice-Architecture-Overview
rrreeeyyy
5
5.2k
A survey of anomaly detection methodologies for web system
rrreeeyyy
5
1.2k
エンジニアリングをちゃんとやる あるいは 人類の平和 について / wsa02-rrreeeyyy
rrreeeyyy
14
3k
Other Decks in Technology
See All in Technology
小規模に始めるデータメッシュとデータガバナンスの実践
kimujun
2
260
TinyMLの技術動向
kyotomon
2
260
AWS re:Inventを徹底的に楽しむためのTips / Tips for thoroughly enjoying AWS re:Invent
yuj1osm
0
180
バイセルにおけるAI活用の取り組みについて紹介します/Generative AI at BuySell Technologies
kyuns
1
200
ガチ勢によるPipeCD運用大全〜滑らかなCI/CDを添えて〜 / ai-pipecd-encyclopedia
cyberagentdevelopers
PRO
2
140
CAMERA-Suite: 広告文生成のための評価スイート / ai-camera-suite
cyberagentdevelopers
PRO
3
230
わたしとトラックポイント / TrackPoint tips
masahirokawahara
1
200
来年もre:Invent2024 に行きたいあなたへ - “集中”と“つながり”で楽しむ -
ny7760
0
110
What's in a Postgres major release? An analysis of contributions in the v17 timeframe | Claire Giordano | PGConf EU 2024
clairegiordano
1
680
dbt-coreで実現するCore DataMartsのデータモデリング〜dbt編〜 / Core DataMarts Modeling with dbt-core
i125
3
1.2k
Vueで Webコンポーネントを作って Reactで使う / 20241030-cloudsign-vuefes_after_night
bengo4com
3
180
現地でMeet Upをやる場合の注意点〜反省点を添えて〜
shotashiratori
0
160
Featured
See All Featured
Fireside Chat
paigeccino
32
3k
The Straight Up "How To Draw Better" Workshop
denniskardys
232
140k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
46
2.1k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
27
4.1k
Fashionably flexible responsive web design (full day workshop)
malarkey
404
65k
Learning to Love Humans: Emotional Interface Design
aarron
272
40k
Optimizing for Happiness
mojombo
376
69k
What's in a price? How to price your products and services
michaelherold
243
11k
Automating Front-end Workflow
addyosmani
1365
200k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
9
670
We Have a Design System, Now What?
morganepeng
50
7.2k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
Transcript
Prometheus ࣮ફೖ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )
1
Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
• Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷԽʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷࢹͰ͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
Prometheus ʹ͍ͭͯ • Prometheus OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ 2.0.0
(11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4
ͳͥ Prometheus Λબ͢Δͷ͔ • ߴ͍ղೳͰͷϝτϦΫεͷอଘʹ͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏͰӡ༻Ͱ͖Δ • Service
Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ͷ࿈ܞՄೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
Prometheus ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • ࢹ͢ΔରΛ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ૿ݮʹରԠͰ͖ΔΑ͏ʹ
*_sd_config Λ͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌ file_sd_config ͰସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲجຊతʹ Grafana Λͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢ΔରΛ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1
port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9
Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷԽ୯७ʹαʔόΛ 2 ىಈ͢Δ͚ͩ 1 • Pull
ܕͳͷͰ 2 ىಈ͓͚ͯͩ͘͠ͰԽʹͳΔ • σʔλ࠷େͰ scrape_interval ͕ͣΕ͚ͨͩͣΕΔ • ݱ࣮ʹʹͳΔ͜ͱগͳ͍ • ࣮ࡍʹϑϩϯτʹ Nginx Λઃஔͯ͠ยํ͕མͪͨΒ͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ͏ Grafana ͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε͕ඦສ͙Β͍·Ͱ 1 ηοτͰेࡹ͚Δͣ • ૿͖͑ͯͨ߹ DC োυϝΠϯຖʹ
1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳͷ Prometheus Λ༻ҙͨ͠߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ߹ԼҐͷ Prometheus Ͱ Record Λ͍σʔλΛू্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ͘͠ Grafana Ͱࢀর͢ΔσʔλιʔεΛ͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑ CloudFlare ͰίϩέʔγϣϯຖʹσʔλΛूͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
Digression: Record ʹ͍ͭͯ • Prometheus Ͱ Recording rule ͱ͍͏ͷΛఆٛग़དྷΔ 4
• Recording rule ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯάϑΣσϨʔγϣϯ࣌ͷूʹ͏ • ࣮ߦִؒ Rule ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠ Alert rule Ͱར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:
mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ࣌ܥྻσʔλΛظؒอଘ͢Δͷʹ͋·Γద͍ͯ͠ͳ͍ 5 • ߴͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ
15 ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ͞Ε͍ͯΔ 6 • ࣮ࡍʹ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB S3 Chronix Λ remote storage ͱ͢Δ࣮͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read remote_write Ͱઃఆ͢Δ • ͘͠ storage.tsdb.retention Λͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔͷ 7 • Ξϥʔτͷάϧʔϐϯά,
ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨ͷࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳ࣮ͯ͘ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
Alertmanager ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ Prometheus ͷํʹఆٛ͢Δ
• Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹϥϕϧͷΛݩʹͯ͠௨ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:
- alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ͰΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢͱͯ͠ΘΕΔ PromQL ͷ for: 1m # 1 ؒҎ্ܧଓͨ͠߹ʹ alertmanager ʹΔ labels: # ͜ͷ͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack Ͱ௨͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨͢Δ݅ʹઃఆ
group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹͭඵ group_interval: 5m # άϧʔϓʹରͯ͠௨Λߦ͏ִؒ # ࠷ॳ 30 ඵͬͯ௨->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε 5 ຖʹ௨ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚ΕԿͳ͘ͱ 1h ຖʹ௨) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔରͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ͕͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ߹ɺ target_match: # warning ͷϚʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
Alertmanager ͷԽʹ͍ͭͯ • Alertmanager ͷԽ -mesh ΦϓγϣϯΛ͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗΛؚΊͯ -mesh.peer
Λෳճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ෦తʹ weaveworks/mesh 9 ͕༻͞ΕͯԽ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹஅ͞Εͨ߹ͳͲΞϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #
alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •
༻్औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹมͰ͖Δ • snmp_exporter: SNMP ͷ͔ΒϝτϦΫεʹมͰ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ
11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Εྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ raw ͳΛग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ͘͠ protocol buffer ͷϑΥʔϚοτ͋Δ • ͳ͍ͷ࡞ΔࣄʹͳΔ͕ݴޠറΓ͕ͳ͘ϑΥʔϚοτ؆୯ͳͷͰ͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ͷϝτϦΫεΛग़ྗ͢ΔΛ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of
the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
Digression: Exporter ϙʔτരൃ • Prometheus ͷ Wiki 9 ΛݟΕ͔Δ௨Γ 1
Exporter 1 ϙʔτΛ͏ • 1 ͭͷΠϯελϯεʹෳͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ͏ • sg ͳͲͷϑΝΠΞΥʔϧͷઃఆΛ͢Δͷ໘ • ͋·ΓෳͷϙʔτΛ Prometheus ʹ͚ͯެ։͢Δඞཁͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛͬͯղܾ͢Δ • ಛఆͷϙʔτΛͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖެࣜυΩϡϝϯτͱݸਓతʹ DigitalOcean
ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌ irate() Ͱͳ͘ rate() Λͬͨ΄͏͕ྑ͍ͳͲͷҙ͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
CPU ༻Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ༻࣍ͷΑ͏ʹॻ͚Δ 15 •
100% ͔Β idle ͷΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ CPU ίΞຖͷ͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ߹ irate Λ rate ʹ͠ɺඌʹ >60 ͷᮢΛॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
Disk ༻ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:
- alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ͷઢܗճؼ͕͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
Digression: Rule ϑΝΠϧͷཧʹ͍ͭͯ • Alert rule ͷཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •
Rule ϑΝΠϧγϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹෆΛײ͡Δ • Role Template Macro ͕͍͍ͨ... • ਖ਼ͳͱ͜ΖΉ͠Ζ͓͍ͷօ͞Μ͕Ͳ͏ཧ͍ͯ͠Δͷ͔Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ͢Δ • τϦοΩʔͳ͜ͱͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟΘ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •
Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29