Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus 実践入門 #hbstudy 79 / introduction-to-p...
Search
rrreeeyyy
November 21, 2017
Technology
16
3.3k
Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice
#hbstudy 79 で Prometheus の話をしました
rrreeeyyy
November 21, 2017
Tweet
Share
More Decks by rrreeeyyy
See All by rrreeeyyy
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
160
An Efficient Incident Response Training with AI / SRE NEXT 2024 Sponsor Session
rrreeeyyy
1
3.4k
カンファレンスから見る SRE トレンド 2024 / SRE Trends from Conferences in 2024 #SRE_Findy
rrreeeyyy
4
2.2k
信頼性の育て方 / mackerel-meetup-15
rrreeeyyy
9
2.4k
SRE の歩き方・進め方 / sre-walk-through-procedure
rrreeeyyy
0
8.5k
「信頼性」を保ちつつ大規模サービスをリニューアルする / cookpad-tech-kitchen-service-embedded-sres
rrreeeyyy
11
12k
Cookpad and Prometheus
rrreeeyyy
6
20k
SRE-Lounge-8-Cookpad-Microservice-Architecture-Overview
rrreeeyyy
5
5.2k
A survey of anomaly detection methodologies for web system
rrreeeyyy
5
1.2k
Other Decks in Technology
See All in Technology
20241120_JAWS_東京_ランチタイムLT#17_AWS認定全冠の先へ
tsumita
2
220
Shopifyアプリ開発における Shopifyの機能活用
sonatard
4
250
個人でもIAM Identity Centerを使おう!(アクセス管理編)
ryder472
3
170
地理情報データをデータベースに格納しよう~ GPUを活用した爆速データベース PG-Stromの紹介 ~
sakaik
1
150
OCI Vault 概要
oracle4engineer
PRO
0
9.7k
[FOSS4G 2019 Niigata] AIによる効率的危険斜面抽出システムの開発について
nssv
0
300
なぜ今 AI Agent なのか _近藤憲児
kenjikondobai
4
1.3k
rootlessコンテナのすゝめ - 研究室サーバーでもできる安全なコンテナ管理
kitsuya0828
3
380
[CV勉強会@関東 ECCV2024 読み会] オンラインマッピング x トラッキング MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping (Chen+, ECCV24)
abemii
0
220
適材適所の技術選定 〜GraphQL・REST API・tRPC〜 / Optimal Technology Selection
kakehashi
1
150
第1回 国土交通省 データコンペ参加者向け勉強会③- Snowflake x estie編 -
estie
0
120
Lambdaと地方とコミュニティ
miu_crescent
2
360
Featured
See All Featured
Designing the Hi-DPI Web
ddemaree
280
34k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
47
2.1k
Gamification - CAS2011
davidbonilla
80
5k
The Cult of Friendly URLs
andyhume
78
6k
GraphQLとの向き合い方2022年版
quramy
43
13k
Optimising Largest Contentful Paint
csswizardry
33
2.9k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.3k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
8
680
Building Applications with DynamoDB
mza
90
6.1k
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
33
1.9k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
44
2.2k
Adopting Sorbet at Scale
ufuk
73
9.1k
Transcript
Prometheus ࣮ફೖ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )
1
Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
• Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷԽʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷࢹͰ͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
Prometheus ʹ͍ͭͯ • Prometheus OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ 2.0.0
(11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4
ͳͥ Prometheus Λબ͢Δͷ͔ • ߴ͍ղೳͰͷϝτϦΫεͷอଘʹ͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏͰӡ༻Ͱ͖Δ • Service
Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ͷ࿈ܞՄೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
Prometheus ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • ࢹ͢ΔରΛ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ૿ݮʹରԠͰ͖ΔΑ͏ʹ
*_sd_config Λ͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌ file_sd_config ͰସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲجຊతʹ Grafana Λͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢ΔରΛ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1
port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9
Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷԽ୯७ʹαʔόΛ 2 ىಈ͢Δ͚ͩ 1 • Pull
ܕͳͷͰ 2 ىಈ͓͚ͯͩ͘͠ͰԽʹͳΔ • σʔλ࠷େͰ scrape_interval ͕ͣΕ͚ͨͩͣΕΔ • ݱ࣮ʹʹͳΔ͜ͱগͳ͍ • ࣮ࡍʹϑϩϯτʹ Nginx Λઃஔͯ͠ยํ͕མͪͨΒ͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ͏ Grafana ͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε͕ඦສ͙Β͍·Ͱ 1 ηοτͰेࡹ͚Δͣ • ૿͖͑ͯͨ߹ DC োυϝΠϯຖʹ
1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳͷ Prometheus Λ༻ҙͨ͠߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ߹ԼҐͷ Prometheus Ͱ Record Λ͍σʔλΛू্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ͘͠ Grafana Ͱࢀর͢ΔσʔλιʔεΛ͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑ CloudFlare ͰίϩέʔγϣϯຖʹσʔλΛूͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
Digression: Record ʹ͍ͭͯ • Prometheus Ͱ Recording rule ͱ͍͏ͷΛఆٛग़དྷΔ 4
• Recording rule ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯάϑΣσϨʔγϣϯ࣌ͷूʹ͏ • ࣮ߦִؒ Rule ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠ Alert rule Ͱར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:
mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ࣌ܥྻσʔλΛظؒอଘ͢Δͷʹ͋·Γద͍ͯ͠ͳ͍ 5 • ߴͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ
15 ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ͞Ε͍ͯΔ 6 • ࣮ࡍʹ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB S3 Chronix Λ remote storage ͱ͢Δ࣮͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read remote_write Ͱઃఆ͢Δ • ͘͠ storage.tsdb.retention Λͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔͷ 7 • Ξϥʔτͷάϧʔϐϯά,
ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨ͷࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳ࣮ͯ͘ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
Alertmanager ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ Prometheus ͷํʹఆٛ͢Δ
• Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹϥϕϧͷΛݩʹͯ͠௨ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:
- alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ͰΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢͱͯ͠ΘΕΔ PromQL ͷ for: 1m # 1 ؒҎ্ܧଓͨ͠߹ʹ alertmanager ʹΔ labels: # ͜ͷ͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack Ͱ௨͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨͢Δ݅ʹઃఆ
group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹͭඵ group_interval: 5m # άϧʔϓʹରͯ͠௨Λߦ͏ִؒ # ࠷ॳ 30 ඵͬͯ௨->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε 5 ຖʹ௨ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚ΕԿͳ͘ͱ 1h ຖʹ௨) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔରͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ͕͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ߹ɺ target_match: # warning ͷϚʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
Alertmanager ͷԽʹ͍ͭͯ • Alertmanager ͷԽ -mesh ΦϓγϣϯΛ͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗΛؚΊͯ -mesh.peer
Λෳճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ෦తʹ weaveworks/mesh 9 ͕༻͞ΕͯԽ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹஅ͞Εͨ߹ͳͲΞϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #
alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •
༻్औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹมͰ͖Δ • snmp_exporter: SNMP ͷ͔ΒϝτϦΫεʹมͰ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ
11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Εྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ raw ͳΛग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ͘͠ protocol buffer ͷϑΥʔϚοτ͋Δ • ͳ͍ͷ࡞ΔࣄʹͳΔ͕ݴޠറΓ͕ͳ͘ϑΥʔϚοτ؆୯ͳͷͰ͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ͷϝτϦΫεΛग़ྗ͢ΔΛ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of
the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
Digression: Exporter ϙʔτരൃ • Prometheus ͷ Wiki 9 ΛݟΕ͔Δ௨Γ 1
Exporter 1 ϙʔτΛ͏ • 1 ͭͷΠϯελϯεʹෳͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ͏ • sg ͳͲͷϑΝΠΞΥʔϧͷઃఆΛ͢Δͷ໘ • ͋·ΓෳͷϙʔτΛ Prometheus ʹ͚ͯެ։͢Δඞཁͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛͬͯղܾ͢Δ • ಛఆͷϙʔτΛͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖެࣜυΩϡϝϯτͱݸਓతʹ DigitalOcean
ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌ irate() Ͱͳ͘ rate() Λͬͨ΄͏͕ྑ͍ͳͲͷҙ͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
CPU ༻Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ༻࣍ͷΑ͏ʹॻ͚Δ 15 •
100% ͔Β idle ͷΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ CPU ίΞຖͷ͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ߹ irate Λ rate ʹ͠ɺඌʹ >60 ͷᮢΛॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
Disk ༻ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:
- alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ͷઢܗճؼ͕͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
Digression: Rule ϑΝΠϧͷཧʹ͍ͭͯ • Alert rule ͷཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •
Rule ϑΝΠϧγϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹෆΛײ͡Δ • Role Template Macro ͕͍͍ͨ... • ਖ਼ͳͱ͜ΖΉ͠Ζ͓͍ͷօ͞Μ͕Ͳ͏ཧ͍ͯ͠Δͷ͔Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ͢Δ • τϦοΩʔͳ͜ͱͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟΘ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •
Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29