[Prometheus Meetup#3] Victoria Metricsで作りあげる大規模・超負荷システムモニタリング基盤 / Monitoring Platform With Victoria Metrics

Victoria Metricsで作る大規模・超負荷システムモニタリング基盤 2020/01/15 Prometheus Meetup#3

2 • 入江順也 • 株式会社コロプラ(2015年入社) ◦ インフラチーム所属 ◦ GKEタイトル運用、運用効率化担当
自己紹介

3 第 9 回 Google Cloud INSIDE Games & Apps
GKEとCloud Spannerが躍動するドラゴンクエストウォーク告知: 最近の発表 https://www.slideshare.net/GoogleCloudPlatformJP/gke-cloud-spanner-9-google-cloud-inside-game-apps

4 アジェンダ • 前提・背景 • 3rd-party Storage検討 • Victoria Metricsとは？
• なぜVictoriaMetricsを選んだのか? • Parameter Tuning • その他の工夫 • まとめ

150Billion (1500億)

累計Datapoints (from 10,000+ Pods)

前提・背景

8 Game Title1 コロプラではKubernetes上の監視に、Prometheus + Grafanaを利用していますこれまでのPrometheus/Grafanaの運用 Single Tenant構成ゲームタイトルごとに各K8sクラスタ
K8sクラスタ : Prometheus = 1:1 API Game Title2 PvP API Game Title3 API

9 Game Title1 コロプラではKubernetes上の監視に、Prometheus + Grafanaを利用していますこれまでのPrometheus/Grafanaの運用 Single Tenant構成ゲームタイトルごとに各K8sクラスタ
K8sクラスタ : Prometheus = 1:1 API Game Title2 PvP API Game Title3 API max1,500 pods

ある日の負荷試験にて・・・

11 想定の¼の負荷試験時・・・

12 開始数分で死ぬPrometheus・・・ OOMKilled

13 Prometheusのスケールアップで対応するも... OOMで死んでしまったので、対応としてとりあえずPrometheusのNodeのスペックアップとresources.requests, resources.limitをアップ↑ 50G 100G 200G ・・・結果は...変化なし！
Prometheusはスペックをいくら上げても、負荷試験でハングアップしてしまう Mem

14 Appendix) Prometheus ボトルネック調査

15 Prometheus自体にはSlow Queryを検知する仕組みがないため、別の仕組みでPrometheus自体のObservabilityも確保したいところ Appendix) Slow Query Slow Query

3rd-party Storage検討

17 おさらい: Prometheusのarchitecture 出典: https://prometheus.io/docs/introduction/overview/ 直接アクセスに問題あり

18 Prometheusのメモリ負荷軽減へ【3rd party製のStorage】・Cortex ・Thanos ・M3DB etc... GrafanaからのQueryをオフロードできるプロダクト当初構築・検証をしたのは、この中のThanos,
M3DBでした。

19 Cortex 【負荷試験当時の問題点】・複雑性・弊社で実績のない componentsが多い・release versionなし出典: https://github.com/cortexproject/cortex

20 Thanos 出典: https://github.com/thanos-io/thanos ThanosはImprobable社が開発を行っている CNCFでホストされているプロダクトです。【負荷試験当時の問題点】 Grafanaでモニタリング ↓ 直近データのQueryは
Prometheusにも行く ↓ Remote Read API問題※により Prometheusのメモリ逼迫 ↓ Prometheus OOMKill Thanos Queryから参照あり 

21 Thanos 出典: https://github.com/thanos-io/thanos ThanosはImprobable社が開発を行っている CNCFでホストされているプロダクトです。【負荷試験当時の問題点】 Grafanaでモニタリング ↓ 直近データのQueryは
Prometheusにも行く ↓ Remote Read API問題※により Prometheusのメモリ逼迫 ↓ Prometheus OOMKill Thanos Queryから参照あり  通常の2倍近くメモリ消費

22 Appendix) Thanos Remote Read API問題(> v2.13.0) 2.13.0以前はStreaming非対応 ↓ Prometheus,
Thanos Sidecar の両方にqueryのresponse がメモリに展開される ↓ ２倍のメモリ必要出典: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming TSDB Server Thanos Sidecar

23 Appendix) Thanos Remote Read API問題(> v2.13.0) 出典: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming TSDB
Prometheus Thanos Sidecar Thanos Query ①. Thanos QueryからのrequestはThanos Sidecarを通してPrometheusへ送られる ①

Prometheus Thanos Sidecar Thanos Query ②. TSDBからmetricsをselect 　　↓ Prometheus内で結果をすべてメモリに展開してから responseとしてThanos Sidecarに渡し最終的にThanos Queryへ返す ②

Prometheus Thanos Sidecar Thanos Query ②. TSDBからmetricsをselect 　　↓ Prometheus内で結果をすべてメモリに展開してから responseとしてThanos Sidecarに渡し最終的にThanos Queryへ返す ② Memoryを２倍使用!!

26 Appendix) Thanos Remote Read API問題(<= v2.13.0) Streaming対応出典: https://prometheus.io/blog/2019/10/10/remote-read-meets-streaming
TSDB Prometheus Thanos Sidecar Thanos Query ①. Thanos QueryからのrequestはThanos Sidecarを通してPrometheusへ送られる ①

TSDB Prometheus Thanos Sidecar Thanos Query ②. TSDBからmetricsをselect 　　↓ 結果の一部をメモリに展開してからチャンクレスポンスとしてThanos Sidecarに渡し順次Thanos Queryへ返す ②

TSDB Prometheus Thanos Sidecar Thanos Query ②. TSDBからmetricsをselect 　　↓ 結果の一部をメモリに展開してからチャンクレスポンスとしてThanos Sidecarに渡し順次Thanos Queryへ返す ② Memory効率劇的に改善

29 M3DB M3DBはPrometheusのスケール問題を解決するためにUberが開発した Remote Writeのプロダクトです。 etcdクラスタを組んだ上で、 M3DBをOperatorやHelm Chartで構築していきます
【負荷試験当時の問題点】・構築、管理コスト高い・Namespaces, Shards等の固有の　概念もあり学習コスト高い出典: https://static.sched.com/hosted_files/kccnceu19/e0/M3%20and%20Prometheus% 2C%20Monitoring%20at%20Planet%20Scale%20for%20Everyone.pdf

30 Victoria Metricsとの出会い https://twitter.com/yosshi_ Victoria Mectricsあるよ！コミュニティの力(助言)・・・！

31 ホストエラー以外で落ちることがなくなり、安定化へ Victoria Metrics導入後

Victoria Metricsとは？

33 Victoria Metrics Victoria Metrics is ... THE BEST LONG-TERM
REMOTE STORAGE FOR PROMETHEUS 公式ページ: https://victoriametrics.com/ Features ・SIMPLIFIES MONITORING ・GLOBAL QUERY VIEW ・DESIGNED TO BE FAST ・NATIVE PROMQL SUPPORT ・LONG TERM STORAGE ・LOW RESOURCE USAGE 本番運用から数ヶ月経ちましたが、非常に高い信頼性で動作しています

34 Architecture 【Mode】 Single version Cluster version の2種あります右図はCluster ver
の図になります出典: https://github.com/Victoria Metrics/VictoriaMetrics/blo b/cluster/README.md

35 Architecture - VMStorage VMStorage TSDBにあたる部分 StatefulSetとしてデプロイする Scale-out可能 (Scale-in不可)
出典: https://github.com/Victoria Metrics/VictoriaMetrics/blo b/cluster/README.md

36 Architecture - VMSelect VMSelect storageに対して Queryを発行する Grafanaの Datasourceには VMSelectのLBを
指定する出典: https://github.com/Victoria Metrics/VictoriaMetrics/blo b/cluster/README.md

37 Architecture - VMInsert VMInsert storageに対して Writeを担当 Prometheusの RemoteWriteには VMInsertのLBを
指定する出典: https://github.com/Victoria Metrics/VictoriaMetrics/blo b/cluster/README.md

38 Scale-out可能な仕組み VMInsert: 1 VMStorage: 2 A VMSelect: 1 metric名
Index A 0 B 1 B Consistent hashing

39 Scale-out可能な仕組み A B metric名 Index A 0→2 B 1→0
A B VMStorage: 3 VMInsert: 1 VMSelect: 1 Consistent hashing

40 Scale-out可能な仕組み A B metric名 Index A 2 B 0
A B VMStorage: 3 VMInsert: 1 VMSelect: 1 すべてのVMStorageに同じQueryを発行 Consistent hashing

41 Appendix) Scale-out可能な仕組み VMInsert: Consistent hashing法により、対象のVMStorage(保存先)を決定する出典: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/app/vminsert/netstorage/insert_ctx.go#L166-L188

42 Appendix) Scale-out可能な仕組み VMSelect: 全VMStorageに対して同じQueryを発行する出典: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/app/vmselect/netstorage/netstorage.go#L725-L749

43 Monitoring構成(Single-cluster環境) 負荷試験初期はSingle-cluster環境で構築し、 8,000+ pods の環境で問題なく動作していました。 (Prometheusは1台) Remote Write Global
Query View 8,000+ pods

44 Monitoring構成(Multi-cluster環境) 合計10,000+ podsの環境で正常に動作しており、 Global Query ViewによりMulti-clusterを意識することなくQueryを発行することができます。 (Prometheusは各Clusterに1台) Remote
Write Global Query View ・・・ 1,000~3,000 pods 1,000~3,000 pods 1,000~3,000 pods

45 PrometheusとVictoria Metricsの役割分担通常のPromethuesはRead + Write(Scrape)を行う PrometheusにはScrapeのみ行わせて、データはVictoria Metricsに集約させる scrape scrape
scrape Remote Write

なぜVictoria Metrics を選んだのか？

47 Victoria Metricsを選んだ理由 • Simplicity Select, Insert, Storageの3要素のみ • Scalability
Scale-inはできないものの、Scale-outは簡単にできる • Reliability 高い信頼性。リソース効率がよく、負荷試験や　　本番リリース後もホストエラー以外では落ちることなく稼働

48 Victoria Metricsの課題(2020/1時点) • VMStorageのScale-in実質不可 Scale-inできるがdataロス発生 • Alert機能なし現状各Prometheusでalert ruleを設定
• WebUI未実装 Prometheusのような/graph画面がないため Queryのdebugがしづらい(Grafana必須)

Parameter Tuning

50 Prometheus remote_write設定 queue_conﬁg: max_shards: 30 capacity: 20,000 max_samples_per_send: 10,000
↓ • 最大 600,000 samples キューイング • Capacityサイズの半分溜まったらRemote Storageへ転送 Parameter Tuning 400k+ samples per second

51 Prometheus remote_write設定 Shards: WALからRemote Storageへ送るときに一時的にSampleを格納する Appendix) Parameter Tuning
shard0 shard1 Dynamic Queues write-ahead log (WAL) Remote Storage

52 Prometheus remote_write設定 Shards: WALからRemote Storageへ送るときに一時的にSampleを格納する Appendix) Parameter Tuning
shard0 shard1 . . . Dynamic Queues write-ahead log (WAL) Remote Storage スクレイピング量に応じ動的に変化 shard2

53 Prometheus remote_write設定 Capacity: 1shardあたりqueuingできるsample数 Appendix) Parameter Tuning . .
. shard . . .

54 Prometheus remote_write設定 max_samples_per_send: 一度に送る最大sample数 ex: max_samples_per_send=3のとき... Appendix) Parameter Tuning
shard . . .

55 Prometheus remote_write設定 max_samples_per_send: 一度に送る最大sample数 ex: max_samples_per_send=3のとき... Appendix) Parameter Tuning
shard . . . Remote Storageへ転送

56 Victoria Metrics VMSelect • maxConcurrentRequests: 2*vCPU 同時Request数はコア数の2倍までが推奨 • maxUniqueTimeseries:
1,000,000 (default: 300,000) Parameter Tuning cannot find tag filter matching less than 300001 time series; either increase -search.maxUniqueTimeseries or use more specific tag filters  天井にあたったParameter名が表示されるので Logベースでチューニングするのが簡単です

その他の工夫

58 Prometheus& Victoria Metrics専用の NodePool(Node Instance Group)を用意 Scheduling上の工夫 Kubernetes cluster
backend pool ZoneA ZoneB Instance Group label: back taint: back Instance Group label: back taint: back app pool ZoneA ZoneB Instance Group label: app taint: app Instance Group label: app taint: app prometheus pool ZoneA ZoneB Instance Group label: prom taint: prom Instance Group label: prom taint: prom system pool ZoneA ZoneB Instance Group label: system taint: system Instance Group label: system taint: system

59 収集Metricsが増えすぎて、kube-state-metricsがOOMKilled → kube-state-metrics専用のNodeを確保 Scheduling上の工夫 Kubernetes cluster backend pool ZoneA
ZoneB Instance Group label: back taint: back Instance Group label: back taint: back app pool ZoneA ZoneB Instance Group label: app taint: app Instance Group label: app taint: app kube-state-metrics pool ZoneA ZoneB Instance Group label: metric taint: metric Instance Group label: metric taint: metric prometheus pool ZoneA ZoneB Instance Group label: prom taint: prom Instance Group label: prom taint: prom

60 Prometheus負荷試験 Prometheus&水平スケール用プロダクトで問題ないかを検証するため、擬似的に TimeSeriesを生成し、Prometheus&水平スケール用プロダクトがその負荷に耐えられるかどうかの試験を実施しました。その時、利用したのがOSSのツール「Avalanche」です。 Avalanche: https://github.com/open-fresh/avalanche 参考記事: https://blog.freshtracks.io/load-testing-prometheus-metric-ingestion-5b878711711c
metric-count series-count interval ... Avalanche scarape_conﬁgs: - job_name: avalanche ... 疑似Metrics生成 Scrape

61 Grafanaによるブラウザクラッシュ Grafana上で表示するパネル & 描画点が多すぎてブラウザがクラッシュ

62 Grafanaの分析処理が時間かかっている箇所は、GraphのRenderをコールする箇所 → 表示時間を短くするには、Renderingする箇所・時間を減らすしかないパネルの分割・Resolutionの削減など

63 Appendix) Grafanaの分析 Grafanaで時間がかかる部分は、 PromQLの実行時間を除けば、 JavaScriptによるスクリプト + 描画 Angular +
React + jQueryの複合レンダリングはjQueryのﬂotを利用

64 GPU性能の良いPCを用意

まとめ

66 まとめ • PrometheusはScrape専用にしてメモリ負荷軽減 • ニーズに応じたプロダクト採用 → 今回はVictoriaMetricsがマッチ • Insertの負荷試験はAvalancheで実現
• Grafanaでは表示するメトリック、時間を絞る

67 Thank you !!

[Prometheus Meetup#3] Victoria Metricsで作りあげる大規模...

[Prometheus Meetup#3] Victoria Metricsで作りあげる大規模・超負荷システムモニタリング基盤 / Monitoring Platform With Victoria Metrics

Other Decks in Programming

Featured

Transcript