Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Prometheus 実践入門 #hbstudy 79 / introduction-to-p...
Search
rrreeeyyy
November 21, 2017
Technology
16
3.3k
Prometheus 実践入門 #hbstudy 79 / introduction-to-prometheus-practice
#hbstudy 79 で Prometheus の話をしました
rrreeeyyy
November 21, 2017
Tweet
Share
More Decks by rrreeeyyy
See All by rrreeeyyy
Incident Response Practices: Waroom's Features and Future Challenges
rrreeeyyy
0
180
An Efficient Incident Response Training with AI / SRE NEXT 2024 Sponsor Session
rrreeeyyy
1
3.6k
カンファレンスから見る SRE トレンド 2024 / SRE Trends from Conferences in 2024 #SRE_Findy
rrreeeyyy
4
2.2k
信頼性の育て方 / mackerel-meetup-15
rrreeeyyy
10
2.4k
SRE の歩き方・進め方 / sre-walk-through-procedure
rrreeeyyy
0
8.6k
「信頼性」を保ちつつ大規模サービスをリニューアルする / cookpad-tech-kitchen-service-embedded-sres
rrreeeyyy
11
12k
Cookpad and Prometheus
rrreeeyyy
6
20k
SRE-Lounge-8-Cookpad-Microservice-Architecture-Overview
rrreeeyyy
5
5.3k
A survey of anomaly detection methodologies for web system
rrreeeyyy
5
1.2k
Other Decks in Technology
See All in Technology
CustomCopを使ってMongoidのコーディングルールを整えてみた
jinoketani
0
220
マイクロサービスにおける容易なトランザクション管理に向けて
scalar
0
110
日本版とグローバル版のモバイルアプリ統合の開発の裏側と今後の展望
miichan
1
120
Fanstaの1年を大解剖! 一人SREはどこまでできるのか!?
syossan27
2
160
podman_update_2024-12
orimanabu
1
260
LINEスキマニにおけるフロントエンド開発
lycorptech_jp
PRO
0
330
OpenShift Virtualizationのネットワーク構成を真剣に考えてみた/OpenShift Virtualization's Network Configuration
tnk4on
0
130
なぜCodeceptJSを選んだか
goataka
0
150
株式会社ログラス − エンジニア向け会社説明資料 / Loglass Comapany Deck for Engineer
loglass2019
3
31k
サイボウズフロントエンドエキスパートチームについて / FrontendExpert Team
cybozuinsideout
PRO
5
38k
LINE Developersプロダクト(LIFF/LINE Login)におけるフロントエンド開発
lycorptech_jp
PRO
0
120
ずっと昔に Star をつけたはずの思い出せない GitHub リポジトリを見つけたい!
rokuosan
0
150
Featured
See All Featured
GraphQLの誤解/rethinking-graphql
sonatard
67
10k
Practical Tips for Bootstrapping Information Extraction Pipelines
honnibal
PRO
10
810
Build The Right Thing And Hit Your Dates
maggiecrowley
33
2.4k
Build your cross-platform service in a week with App Engine
jlugia
229
18k
Templates, Plugins, & Blocks: Oh My! Creating the theme that thinks of everything
marktimemedia
28
2.1k
The Success of Rails: Ensuring Growth for the Next 100 Years
eileencodes
44
6.9k
Speed Design
sergeychernyshev
25
670
How GitHub (no longer) Works
holman
311
140k
Side Projects
sachag
452
42k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Helping Users Find Their Own Way: Creating Modern Search Experiences
danielanewman
29
2.3k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
247
1.3M
Transcript
Prometheus ࣮ફೖ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy )
1
Agenda • Prometheus ʹ͍ͭͯ • Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷεέʔϧઓུʹ͍ͭͯ
• Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Alertmanager ʹ͍ͭͯ • Alertmanager ͷԽʹ͍ͭͯ • Exporter ʹ͍ͭͯ • ࣮ࡍͷࢹͰ͑ͦ͏ͳ Exporter ʹ͍ͭͯ • Rule ϑΝΠϧͷཧʹ͍ͭͯ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 2
Prometheus ʹ͍ͭͯ • Prometheus OSS ͷϞχλϦϯάπʔϧ • ݱࡏͷ࠷৽όʔδϣϯ 2.0.0
(11/8 ϦϦʔε) • Google ʹଘࡏ͍ͯ͠Δ Borgmon ͱ͍͏ϞχλϦϯάπʔϧʹΠϯεύΠΞ͞Ε͍ͯΔ • Borgmon ʹ͍ͭͯ SRE ຊ 10 ষΛಡΉͱৄ͘͠ॻ͍ͯ͋Δ • ࣍ͷΑ͏ͳಛ͕͋Δ • Pull ܕͷΞʔΩςΫνϟ • ͦΕͳΓʹߴͳ࣌ܥྻσʔλϕʔε • PromQL ʹΑΔϓϩάϥϚϒϧͳ࣌ܥྻσʔλॲཧ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 3
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 4
ͳͥ Prometheus Λબ͢Δͷ͔ • ߴ͍ղೳͰͷϝτϦΫεͷอଘʹ͑ΒΕΔ • Pull ܕͷΞʔΩςΫνϟͰൺֱత୯७ͳߏͰӡ༻Ͱ͖Δ • Service
Discovery ͕ॆ࣮͍ͯ͠Δ • PromQL ͷදݱྗ͕ߴ༷͘ʑͳ౷ܭ͕औΕΔ • CNCF ೖΓΛՌͨ͠ Kubernetes ͷ࿈ܞՄೳ • σϑΝΫτͱ͞ΕΔπʔϧͱͷ࿈ܞ͕Մೳͳͷॏཁ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 5
Prometheus ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • ࢹ͢ΔରΛ scrape_configs Ͱॻ͍͍͚ͯͩ͘ • جຊతʹ૿ݮʹରԠͰ͖ΔΑ͏ʹ
*_sd_config Λ͏Α͏ʹ͢Δ • ରԠ͢Δ sd ͕ͳ͍࣌ file_sd_config ͰସͰ͖ΔՄೳੑ͕͋Δ • ࢦఆͷϑΥʔϚοτͰϑΝΠϧʹॻ͖ࠐΜͰஔ͘ͱ reload ແ͠ͰಡΜͰ͘ΕΔ • μογϡϘʔυͳͲجຊతʹ Grafana Λͬͯ࡞ΔΑ͏ʹ͢Δ • Datasource Λ Prometheus ʹͯ͠ඳը͢ΔରΛ PromQL Ͱॻ͚Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 6
ઃఆྫ: EC2 ͷΠϯελϯε༻ͷઃఆ - job_name: 'node' ec2_sd_configs: - region: ap-northeast-1
port: 9100 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^(app|db)$ # Role λάʹ app, db ͕͍͍ͭͯΔͷ͚ͩ action: keep - source_labels: [__meta_ec2_tag_Name] # target_label Λࢦఆ͓ͯ͘͠ͱɺ target_label: instance # PromQL ͰͷߜΓࠐΈ݅ͱͯ͠ɺ - source_labels: [__meta_ec2_tag_Role] # ઃఆͨ͠ϥϕϧΛར༻Ͱ͖ΔΑ͏ʹͳΔ target_label: role - source_labels: [__meta_ec2_tag_Status] target_label: status - source_labels: [__meta_ec2_instance_type] target_label: instance_type - source_labels: [__meta_ec2_availability_zone] target_label: availability_zone - source_labels: [__meta_ec2_vpc_id] target_label: vpc_id hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 7
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 8
hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 9
Prometheus ͷԽʹ͍ͭͯ • Prometheus ͷԽ୯७ʹαʔόΛ 2 ىಈ͢Δ͚ͩ 1 • Pull
ܕͳͷͰ 2 ىಈ͓͚ͯͩ͘͠ͰԽʹͳΔ • σʔλ࠷େͰ scrape_interval ͕ͣΕ͚ͨͩͣΕΔ • ݱ࣮ʹʹͳΔ͜ͱগͳ͍ • ࣮ࡍʹϑϩϯτʹ Nginx Λઃஔͯ͠ยํ͕མͪͨΒ͏ยํ͕ࢀর͞ΕΔΑ͏ʹ͢Δ • άϥϑͷඳըʹ͏ Grafana ͕ࢀর͢ΔσʔλιʔεΛ Nginx ͷϗετʹઃఆ͢Δ 1 h$ps:/ /github.com/prometheus/prometheus/issues/1500 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 10
Prometheus ͷεέʔϧઓུʹ͍ͭͯ • ϝτϦΫε͕ඦສ͙Β͍·Ͱ 1 ηοτͰेࡹ͚Δͣ • ૿͖͑ͯͨ߹ DC োυϝΠϯຖʹ
1 ηοτͣͭ Prometheus Λ༻ҙ͢Δ 2 • ෳͷ Prometheus Λ༻ҙͨ͠߹ϑΣσϨʔγϣϯΛߦ͏͜ͱ͕ग़དྷΔ • ԼҐͷ Prometheus ͷ /federate ΤϯυϙΠϯτΛεΫϨΠϓ͢Δ • େମͷ߹ԼҐͷ Prometheus Ͱ Record Λ͍σʔλΛू্ͨ͠ͰϑΣσϨʔγϣϯ͢Δ • ͘͠ Grafana Ͱࢀর͢ΔσʔλιʔεΛ͚ΔͳͲ͕ߟ͑ΒΕΔ • ྫ͑ CloudFlare ͰίϩέʔγϣϯຖʹσʔλΛूͯ͠ϑΣσϨʔγϣϯ͍ͯ͠Δ 3 3 h$ps:/ /promcon.io/2017-munich/slides/monitoring-cloudflares-planet-scale-edge-network-with-prometheus.pdf 2 h$ps:/ /www.robustpercep2on.io/scaling-and-federa2ng-prometheus/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 11
Digression: Record ʹ͍ͭͯ • Prometheus Ͱ Recording rule ͱ͍͏ͷΛఆٛग़དྷΔ 4
• Recording rule ఆٛͨ͠ PromQL ΛҰఆִؒͰ࣮ߦͰ͖Δ • ࣮ߦ݁ՌΛผͷ໊લͷ࣌ܥྻσʔλͱͯ͠อଘ͢Δ͜ͱ͕ग़དྷΔ • ࣌ܥྻσʔλͷαϯϓϦϯάϑΣσϨʔγϣϯ࣌ͷूʹ͏ • ࣮ߦִؒ Rule ͷ interval ͔ evaluation_interval Ͱܾఆ͞ΕΔ • Record Ͱఆٛͨ͠ Alert rule Ͱར༻Մೳ 4 h$ps:/ /prometheus.io/docs/prometheus/latest/configura8on/recording_rules/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 12
Digression: Record/Alert ͷྫ groups: - name: mysql.rules rules: - record:
mysql_slave_lag_seconds expr: mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay - alert: MySQLReplicationLag expr: (mysql_slave_lag_seconds > 30) and ON(instance) (predict_linear(mysql_slave_lag_seconds[5m], 60 * 2) > 0) for: 1m labels: severity: critical annotations: description: The mysql slave replication has fallen behind and is not recovering summary: MySQL slave replication is lagging hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 13
Prometheus ͷσʔλอ࣋ظؒʹ͍ͭͯ • Prometheus ࣌ܥྻσʔλΛظؒอଘ͢Δͷʹ͋·Γద͍ͯ͠ͳ͍ 5 • ߴͳΫΤϦॲཧΛ࣮ݱ͢ΔͨΊͷΞʔΩςΫνϟ্ͷ੍ • σϑΥϧτͰͷ࣌ܥྻσʔλͷอ࣋ظؒ
15 ؒ • Long-term storage ͱ͍͏ผͷετϨʔδʹσʔλΛอଘ͢Δํ͕ࣜਪ͞Ε͍ͯΔ 6 • ࣮ࡍʹ HTTP Ͱ protocol buffer ͷσʔλ͕ඈΜͰདྷΔ͚ͩ • InfluxDB S3 Chronix Λ remote storage ͱ͢Δ࣮͕ଘࡏ͍ͯ͠Δ • Prometheus ͷઃఆͷ remote_read remote_write Ͱઃఆ͢Δ • ͘͠ storage.tsdb.retention Λͨ͘͠ prometheus ʹ federaFon ͤ͞ΔͳͲ 6 h$ps:/ /prometheus.io/docs/prometheus/latest/storage/#remote-storage-integra9ons 5 h$p:/ /techlife.cookpad.com/entry/7meseries-database-001 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 14
Alertmanager ʹ͍ͭͯ • Prometheus ͷ alert Λड͚औΓϋϯυϧͯ͘͠ΕΔͷ 7 • Ξϥʔτͷάϧʔϐϯά,
ϧʔςΟϯά, ॏෳഉআ͕ग़དྷΔ • Ξϥʔτͷݕࡧɾ௨ͷࢭͳͲ͕ WebUI / amtool ίϚϯυ͔ΒՄೳ • Prometheus Ͱͳ࣮ͯ͘ಈ͘ 8 • /api/v1/alerts ΤϯυϙΠϯτʹ JSON Λ POST ͍ͯ͠Δ͚ͩ 8 h$ps:/ /prometheus.io/docs/aler5ng/clients/ 7 h$ps:/ /prometheus.io/docs/aler5ng/alertmanager/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 15
Alertmanager ͷઃఆʹ͍ͭͯ • ΠϯετʔϧجຊతʹόΠφϦΛஔ͚ͩ͘ • Ξϥʔτͷ௨ϧʔϧɾॏෳഉআϧʔϧͳͲΛهड़͢Δ • Ξϥʔτͷϧʔϧࣗମ Prometheus ͷํʹఆٛ͢Δ
• Prometheus ͷ Rule Ͱఆ༷ٛͨ͠ʑͳϥϕϧ͕ར༻Մೳ • جຊతʹϥϕϧͷΛݩʹͯ͠௨ઌΛܾఆ͢Δ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 16
Prometheus ͷ Alert rule ͷઃఆྫ groups: - name: linux rules:
- alert: InstanceDown # AlertnameɻҰൠతʹ Grouping ͰΘΕΔ expr: up == 0 # ࣮ࡍʹΞϥʔτͷᮢͱͯ͠ΘΕΔ PromQL ͷ for: 1m # 1 ؒҎ্ܧଓͨ͠߹ʹ alertmanager ʹΔ labels: # ͜ͷ͕ Alertmanager ଆͰར༻Մೳ severity: CRITICAL annotations: # Slack Ͱ௨͞ΕΔࡍʹ annotations ͕ར༻͞ΕΔɻ description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down - alert: CPUUtilization expr: 100 - (avg(rate(node_cpu{job="node",mode="idle"}[1m])) BY (instance) * 100) > 60 for: 1m labels: severity: CRITICAL annotations: description: '{{ $labels.instance }} has been use high cpu more than 1 minutes.' summary: Instance {{ $labels.instance }} cpu utilization is high hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 17
global: resolve_timeout: 5m route: group_by: ['alertname', 'instance'] # receiver ʹ௨͢Δ݅ʹઃఆ
group_wait: 30s # ࠷ॳͷॏෳഉআͷͨΊʹͭඵ group_interval: 5m # άϧʔϓʹରͯ͠௨Λߦ͏ִؒ # ࠷ॳ 30 ඵͬͯ௨->Ҏޙ৽͍͠Ξϥʔτ͕͋Ε 5 ຖʹ௨ repeat_interval: 1h # ࠶ૹ͞ΕΔ·Ͱͷ࣌ؒ(resolve ͍ͯ͠ͳ͚ΕԿͳ͘ͱ 1h ຖʹ௨) routes: # ΞϥʔτͷϧʔςΟϯάͷઃఆ - match_re: # Rule Ͱઃఆͨ͠λάʹରͯ͠ϧʔςΟϯάΛॻ͚Δ service: ^sre$ receiver: 'sre-pagerduty' receivers: # ΞϥʔτΛड͚औΔରͷઃఆ - name: 'sre-page' # webhook, email, pagerduty ͕͑Δ pagerduty_configs: - service_key: xxxxxxxxxxxxxxxxxxxxxxxx inhibit_rules: # Ξϥʔτͷॏෳഉআͷઃఆ - source_match: # طʹΞϥʔτ໊ɾΠϯελϯε໊͕ಉ͡, severity: 'critical' # critical ͷ alert ͕͋Δ߹ɺ target_match: # warning ͷϚʔδ͞ΕͯऔΓѻΘΕΔ severity: 'warning' equal: ['alertname', 'instance'] hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 18
Alertmanager ͷԽʹ͍ͭͯ • Alertmanager ͷԽ -mesh ΦϓγϣϯΛ͏͜ͱͰՄೳ • جຊతʹશͯͷϊʔυͰࣗΛؚΊͯ -mesh.peer
Λෳճࢦఆ͢Δ • ex.) alertmanager -mesh.peer alertmanager-001 -mesh.peer alertmanager-002 • TCP ͷ 6783 ൪ϙʔτͰ 001 ͱ 002 ͕ΓͱΓΛ։࢝͢Δ • Prometheus ͷ alerting ઃఆ߲ͷ targets ʹ 2 ͭͷ alertmanager Λهड़͢Δ • ෦తʹ weaveworks/mesh 9 ͕༻͞ΕͯԽ͕࣮ݱ͞Ε͍ͯΔ • gossip protocol (membership) Λ༻͍ͯ CAP ͷ AP Λຬ͍ͨͯ͠Δ • ωοτϫʔΫతʹஅ͞Εͨ߹ͳͲΞϥʔτ͕ॏෳͯ͠ૹΒΕͯ͘Δ 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 19
ઃఆྫ: Alertmanager ͱ࿈ܞ͢Δ Prometheus ͷઃఆ alerting: alertmanagers: - ec2_sd_configs: #
alertmanager ࣗମͷ - region: ap-northeast-1 # service discovery ग़དྷΔ port: 9093 relabel_configs: - source_labels: [__meta_ec2_instance_state] regex: ^running$ # running ͷͷ action: keep - source_labels: [__meta_ec2_tag_Role] regex: ^alertmanager$ # Role λά͕ alertmanager ʹͳ͍ͬͯΔͷ action: keep hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 20
Exporter ʹ͍ͭͯ • Prometheus ͕ Pull ͠ʹ͍͘ઌͷαʔόΛ Exporter ͱ͍͏ •
༻్औಘ͍ͨ͠ϝτϦΫεʹԠ༷ͯ͡ʑͳ Exporter ͕͋Δ 10 • node_exporter: Linux ͷඪ४తͳϝτϦΫε • mysqld_exporter: MySQL ͷඪ४తͳϝτϦΫε • nginx_exporter: nginx_status ͷϝτϦΫε • mtail: ϩάΛ tail ͰݟͯϝτϦΫεʹมͰ͖Δ • snmp_exporter: SNMP ͷ͔ΒϝτϦΫεʹมͰ͖Δ 10 h%ps:/ /github.com/prometheus/prometheus/wiki/Default-port-alloca<ons hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 21
Exporter Λࣗ࡞͢Δ • γϯϓϧͳ HTTP ͷ endpoint Λ༻ҙ͢Δ͚ͩͰ exporter ʹͳΔ
11 • 'metrics_name value\n' Λు͘ΤϯυϙΠϯτ͕͋Εྑ͍ • ΞϓϦέʔγϣϯݻ༗ͷϝτϦΫεͳͲ؆୯ʹऩूͰ͖Δ • جຊతʹ exporter ଆͰ raw ͳΛग़ͯ͠ Prometheus ଆͰूܭ͢ΔΑ͏ʹ͢Δ • ͘͠ protocol buffer ͷϑΥʔϚοτ͋Δ • ͳ͍ͷ࡞ΔࣄʹͳΔ͕ݴޠറΓ͕ͳ͘ϑΥʔϚοτ؆୯ͳͷͰ͘͠ͳ͍ • ࣮ࡍʹ API Gateway + Lambda Ͱ AWS ͷϝτϦΫεΛग़ྗ͢ΔΛ࡞ͬͨΓ͍ͯ͠Δ 11 h$ps:/ /prometheus.io/docs/instrumen4ng/exposi4on_formats/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 22
࣮ࡍʹPrometheusࣗମͷϝτϦΫεΛோΊ༷ͨࢠ $ curl localhost:9090/metrics # HELP go_gc_duration_seconds A summary of
the GC invocation durations. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 5.9729e-05 go_gc_duration_seconds{quantile="0.25"} 9.75e-05 go_gc_duration_seconds{quantile="0.5"} 0.000117034 go_gc_duration_seconds{quantile="0.75"} 0.000157237 go_gc_duration_seconds{quantile="1"} 0.0067897 go_gc_duration_seconds_sum 10.408703235 go_gc_duration_seconds_count 33117 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 54 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 23
Digression: Exporter ϙʔτരൃ • Prometheus ͷ Wiki 9 ΛݟΕ͔Δ௨Γ 1
Exporter 1 ϙʔτΛ͏ • 1 ͭͷΠϯελϯεʹෳͷ Exporter ΛೖΕΔͱϙʔτΛͨ͘͞Μ͏ • sg ͳͲͷϑΝΠΞΥʔϧͷઃఆΛ͢Δͷ໘ • ͋·ΓෳͷϙʔτΛ Prometheus ʹ͚ͯެ։͢Δඞཁͳ͍ • rrreeeyyy/exporter_proxy 12 ͳͲΛͬͯղܾ͢Δ • ಛఆͷϙʔτΛͬͯ Prometheus ଆͷ metrics_path Λར༻ͯ͠ Exporter Λผ͢Δ 12 h%ps:/ /github.com/rrreeeyyy/exporter_proxy 9 h$ps:/ /github.com/weaveworks/mesh hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 24
PromQL ʹ͍ͭͯ • Prometheus Ͱ࣌ܥྻσʔλΛॲཧ͢ΔͨΊʹ༻͢ΔΫΤϦݴޠ • ׳ΕΔ·Ͱ͘͠ײ͡Δ͕׳ΕΔͱදݱྗ͕ߴ͘ศར • ೖެࣜυΩϡϝϯτͱݸਓతʹ DigitalOcean
ͷࢿྉ͕ྑ͔ͬͨ 13 14 • Alering PromQL Λར༻ͯ͠ߦ͏ • ౷ܭతʹॲཧͨ݁͠ՌͷΞϥʔτϧʔϧͳͲ͕ ॻ͚Δ • Aler:ng ͷ࣌ irate() Ͱͳ͘ rate() Λͬͨ΄͏͕ྑ͍ͳͲͷҙ͋Δ 14 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2 13 h%ps:/ /www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-1 hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 25
CPU ༻Λܭࢉ͢Δ PromQL • node_exporter Ͱऩूͨ͠ϗετ୯Ґͷ CPU ༻࣍ͷΑ͏ʹॻ͚Δ 15 •
100% ͔Β idle ͷΛҾ͍ͯΠϯελϯεΛج४ʹͯ͠ฏۉΛऔΔ • node_cpu ʹ CPU ίΞຖͷ͕ೖ͍ͬͯΔͨΊ • Alert Rule ʹ͢Δ߹ irate Λ rate ʹ͠ɺඌʹ >60 ͷᮢΛॻ͘ 100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100) 15 h%ps:/ /www.robustpercep3on.io/understanding-machine-cpu-usage/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 26
Disk ༻ͷΞϥʔτΛग़͢ Alert rule ઃఆ 16 - name: node.rules rules:
- alert: DiskWillFillIn4Hours expr: predict_linear(node_filesystem_free{job="node"}[1h], 4 * 3600) < 0 for: 5m labels: severity: page • predict_linear ͷઢܗճؼ͕͑ΔͷͰ 4 ࣌ؒޙʹσΟ εΫ༰ྔ͕ 0 ҎԼʹͳΔΑ͏ͳͷΛΞϥʔτग़དྷΔ 16 h%ps:/ /www.robustpercep3on.io/reduce-noise-from-disk-space-alerts/ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 27
Digression: Rule ϑΝΠϧͷཧʹ͍ͭͯ • Alert rule ͷཧΛ Prometheus Ͱߦ͏ඞཁ͕͋Δ •
Rule ϑΝΠϧγϯϓϧͳ YAML Ͱॻ͔ΕΔ • Zabbix ͳͲ͔ΒݟΔͱػೳ໘ʹෆΛײ͡Δ • Role Template Macro ͕͍͍ͨ... • ਖ਼ͳͱ͜ΖΉ͠Ζ͓͍ͷօ͞Μ͕Ͳ͏ཧ͍ͯ͠Δͷ͔Γ͍ͨ • WebUI (Promgen ͱ͔ʁ) ͕ݱঢ়༗ྗͳؾ͢Δ • τϦοΩʔͳ͜ͱͤͣγϯϓϧʹ͠Ζɺͱ͍͏ҙݟΘ͔Δ • Kubernetes ʹର͢Δ ksonnet ͷΑ͏ʹ jsonnet Ͱॻ͍ͯΈΔͱ͍͏Ҋ͋Γͦ͏ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 28
·ͱΊ • Prometheus Λຊ൪ʹಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ • Խɾεέʔϧઓུɾσʔλอ࣋ظؒͳͲ • Alertmanager Λຊ൪ಋೖ͢Δʹ͋ͨͬͯߟ͑ΔࣄΛઆ໌͠·ͨ͠ •
Խɾ࣮ࡍͷઃఆͳͲɹ • Exporter ͷࣗ࡞ PromQL ʹ͍ͭͯઆ໌͠·ͨ͠ hbstudy#79 (2017/11/20) | Yoshikawa Ryota ( @rrreeeyyy ) 29