Unit Testing for Prometheus Rules

Prometheus Meetup Tokyo Takashi Kusumi <[email protected]> Unit Testing for Prometheus
Rules

Agenda ▶ Prometheus ルールの運⽤課題 ▶ promtool を使ったユニットテスト ▶ ユニットテストで学ぶ Prometheus
2

Prometheus ルールの運⽤課題

Prometheus ルールの運⽤課題 ▶ Prometheus ルール + Alert Rule と Recording
Rule の2 種類 ▶ ルールを簡単に検証できない + 作成・変更が難しい + レビューが難しい + 学習しづらい ▶ 原因はメトリクスのデータを準備するのが難しいこと 4

Who will watch the watchmen? ▶ ルールの構⽂が誤っていると Prometheus が起動に失敗する ▶
CD で設定を配布するような場合は特に注意 ▶ 設定のリロードを使っていると、次の再起動時で失敗 5 level=error error loading config from \"/etc/prometheus/config/ prometheus.yml\": one or more errors occurred while applying the new configuration (--config.file=\"/etc/prometheus/config/prometheus.yml\")"

空⽩が含まれている！

Promtool is watching you 7

promtool を使ったユニットテスト

promtool とは ▶ Prometheus に付属する CLI ツール + Docker イメージ
や homebrew パッケージにも含まれる ▶ サブコマンドとしてユニットテストが実装されている + promtool test rules <test-rule-file> + メトリクスデータが簡単に準備できる ▶ 設定・ルールファイルの構⽂チェックも便利 + promtool check rules <rule-files> 9

promtool test rules <test-rule-ﬁle>... 10 $ promtool test rules myalert-test.yaml
# シンプルに結果だけが表⽰される Unit Testing: myalert-test.yaml SUCCESS # テストが通ると戻り値が 0 になる $ echo $? 0

テスト失敗時 11 $ promtool test rules myalert-test.yaml Unit Testing: myalert-test.yaml
FAILED: alertname:InstanceDown, time:5m0s, exp:"[Labels:{alertname=\"InstanceDown\", .. # 期待するアラート got:”[]" # 実際の値 (アラートがあがらなかった) $ echo $? # テストが失敗すると戻り値が 1 になる 1

公式ドキュメント 12 https://prometheus.io/docs/prometheus/latest/conﬁguration/unit_testing_rules/

ユニットテストの書き⽅

テストするアラート 14 1 groups: 2 - name: example 3 rules:
4 # 5 分間 scrape 失敗が続くとアラートをあげる 5 - alert: InstanceDown 6 expr: up == 0 7 for: 5m 8 annotations: 9 summary: "Instance {{ $labels.instance }} down" 10 description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

アラートはいつあがるか？ 15 0m 1m 2m 3m 4m 5m 6m 7m
8m 9m 10m up 1 1 0 0 0 0 0 0 0 1 1 for: 5m InstanceDown アラートが pending になる InstanceDown アラートが発⽕ (ﬁring) @2m @7m アラートが解消 @9m 経過時間

16 1 rule_files: 2 - myalert.yaml # ルールファイルを読みこむ 3 4
tests: # テストを記載していく 5 - interval: 1m # 値の間隔 (scrape_interval 相当) 6 input_series: # 想定する⼊⼒の timeseries を定義 7 - series: 'up{instance="myinstance", job="myjob"}' 8 values: 1 1 0 0 0 0 0 0 0 1 1 # メトリクスの値 9 10 alert_rule_test: 11 - eval_time: 7m # テストするタイミング 12 alertname: InstanceDown # 期待するアラート 13 exp_alerts: 14 - exp_labels: # 期待するアラートのラベル 15 instance: myinstance 16 job: myjob 17 exp_annotations: # 期待するアラートのアノテーション 18 summary: 'Instance myinstance down' 19 description: 'myinstance of job myjob has been down for more than 5 minutes.' ユニットテスト

アラートが上がらないことをテストする 1 alert_rule_test: 2 - eval_time: 6m 3 alertname: InstanceDown
4 exp_alerts: [] # 空でアラートがあがらないことを確認する 5 - eval_time: 7m 6 # 略: アラートがあがることを確認する 7 - eval_time: 9m 8 alertname: InstanceDown 9 exp_alerts: [] # アラートが解消したことを確認 17

値の省略記法

値の省略記法 20 values: 0 0 0 0 0 0 100
110 120 130 140 150 10 9 8 values: 0+0x5 100+10x5 10-1x2 初期値加算値回数 ※初期値も合わせると回数 + 1 の数列になる

省略記法の活⽤ ▶ アラートの for が⻑いとき + e.g. for: 1d ▶
range vector の範囲が広いとき + e.g. predict_linear(mymetric[4h], ) ▶ カウンタ値のテスト + e.g. sum(rate(mymetric[5m])) 21

PromQL の試し⽅

PromQL の試し⽅ 23 1 tests: 2 - interval: 1m 3
input_series: 4 # ⼊⼒値を定義 (省略) 5 promql_expr_test: 6 - eval_time: 10m 7 expr: 'sum(http_requests_total)' 8 exp_samples: 9 - value: 3000 Recording Rule もこの⽅法でテストする

24 1 tests: 2 - interval: 1m 3 input_series: 4
- series: 'http_requests_total{path="/foo"}' 5 values: 0+100x10 6 - series: 'http_requests_total{path="/bar"}' 7 values: 0+200x10 8 promql_expr_test: # sum と sum_over_time の違いを試す例 9 - eval_time: 10m 10 expr: 'sum(http_requests_total)' 11 exp_samples: 12 - value: 3000 # 10分時点の値の合計 (1000 + 2000) 13 - eval_time: 10m 14 expr: 'sum_over_time(http_requests_total[10m])' 15 exp_samples: 16 - labels: '{path="/foo"}' 17 value: 5500 # 100 200 ... 1000 の合計 18 - labels: '{path="/bar"}' 19 value: 11000 # 200 400 ... 2000 の合計 PromQL の確認

Promtool Tips

まずは構⽂チェックから CI に組み込む ▶ まずは promtool check rules でルールの構⽂をチェックする +
これだけで構⽂エラーで落ちる問題を防げる 27 $ promtool check rules myalert.yaml Checking myalert.yaml FAILED: # PromQL の構⽂が誤っている例 myalert.yaml: 7:11: group "example", rule 0, "InstanceDown": could not parse expression: 1:4: parse error: unexpected "="

promtool を Docker で実⾏する ▶ promtool は prom/prometheus イメージに含まれている ▶
entrypoint を変える必要がある ▶ nobody ユーザで動作するの -u root オプションが必要な場合も 28 $ docker run -u root -v "$PWD:/work" -w "/work" \ --entrypoint /bin/promtool prom/prometheus:v2.18.1 \ test rules *.yaml

ユニットテスト時のタイムスタンプ ▶ Unix エポック (1970/1/1 00:00:00) から始まる ▶ 時刻関連の関数を含むクエリでは意識する必要がある +
UTC なことにも注意 29 1 tests: 2 - promql_expr_test: 3 - eval_time: 8760h # 1 年後 (24*365) 4 expr: 'year()' 5 exp_samples: 6 - value: 1971 # 1970 年の 1 年後

ユニットテストで学ぶ Prometheus

浮動⼩数点の精度

0.1 + 0.2 = ? 32 1 tests: 2 -
promql_expr_test: 3 - expr: '0.1 + 0.2' 4 exp_samples: 5 - value: 0.3 FAILED: expr: "0.1 + 0.2", time: 0s, exp:"{} 3E-01" got:"{} 3.0000000000000004E-01"

不動⼩数点の精度 ▶ メトリクスの値は Go の ﬂoat64 で保持される + 差分圧縮の効率が良い ▶
0.1 + 0.2 = 0.30000000000000004 + 浮動⼩数点の精度による ▶ 実⽤上は問題ない精度 ▶ 演算結果を == で⽐較する場合は注意 33

Storing 16 Bytes at Scale (PromCon 2017) 34 https://promcon.io/2017-munich/talks/storing-16-bytes-at-scale/

Staleneess の仕組み

メトリクスはいつ取得できなくなる？ (Staleness) 36 1 tests: 2 - interval: 1m 3
input_series: 4 - series: 'metric1' 5 values: 0 1 2 3 4 # 0〜4 分⽬のみ値がある 6 promql_expr_test: 7 - eval_time: 5m 8 expr: 'metric1' 9 exp_samples: [] # 5 分⽬は値が取得できない？ ※ 通常は後述の Staleness Marker が最後に⼊る

直近 5 分のメトリクスは取得できる 37 FAILED: expr: "metric1", time: 5m0s, exp:"nil"
got:"{__name__=\"metric1\"} 4E+00” 時間 0m 1m 2m 3m 4m 直近 5 分間は値が有効 ※1 最後の値が取れる ※1 —query.lookback-deltaで変更可能

Staleness Marker を追加 38 1 tests: 2 - interval: 1m
3 input_series: 4 - series: 'metric1' 5 values: 0 1 2 3 4 stale # staleness makrer 6 promql_expr_test: 7 - eval_time: 5m 8 expr: 'metric1' 9 exp_samples: [] # 今度はこれでテストが通る

Staleness Marker の役割 39 時間 0m 1m 2m 3m 4m
以降は値が取れない Staleness Marker (StaleNaN) ▶ Scrape の失敗や、Time Series が存在しなくなった場合に Staleness Marker という特別な値が⼊る ▶ 判定が遅れる問題に重複カウントされる問題が回避される

Staleness in Prometheus 2.0 (PromCon 2017) 40 https://promcon.io/2017-munich/talks/staleness-in-prometheus-2-0/

まとめ

まとめ ▶ Promtool を使えばルールの検証が簡単に⾏える + メトリクスデータの準備が簡単 + ルールの作成やレビューの効率が上がる ▶ ルールの構⽂チェックだけでも有⽤
▶ PromQL を学ぶにも promtool は最適 42

ご清聴ありがとうございました

Unit Testing for Prometheus Rules

Unit Testing for Prometheus Rules

More Decks by Takashi Kusumi

Other Decks in Programming

Featured

Transcript