https://kubernetespodcast.com/episode/037-prometheus-and-openmetrics/ Kubernetes Podcast「Prometheus and OpenMetrics, with Richard Hartmann」より Prometheus came into being because a few ex-Googlers were quite unhappy with what they found in the open-source and also in the paid software world. So they basically reimplemented large parts of Borgmon into Prometheus. In recent years, monitoring has undergone a Cambrian Explosion: Riemann, Heka, Bosun, and Prometheus have emerged as open source tools that are very similar to Borgmon’s time-series–based alerting. In particular, Prometheus shares many similarities with Borgmon, especially when you compare the two rule languages. The principles of variable collection and rule evaluation remain the same across all these tools and provide an environment with which you can experiment, and hopefully launch into production 「Site Reliability Engineering」より Prometheus は Borg を参考に作られた時系列ベースのアラートシステム 参考 Site Reliability Engineering https://sre.google/books/
was created in 2003, a new monitoring system—Borgmon—was built to complement it. 「Site Reliability Engineering」より Borgmon は Borg のために作られた監視システム 参考 Site Reliability Engineering https://sre.google/books/
similar to Apache Mesos. Borg manages its jobs at the cluster level. 「Site Reliability Engineering」より Borg は分散のクラスタ管理システム Kubernetes はこの Borg を参考に作られている 参考 Site Reliability Engineering https://sre.google/books/
base from which to build. In this case, the basic foundations of SRE include SLOs, monitoring, alerting, toil reduction, and simplicity. Getting these basics right will set you up well to succeed on your SRE journey. 「The Site Reliability Workbook」より SLO, monitoring, alerting の順番になっているので、まずは SLOが何か知ろう 参考 The Site Reliability Engineering Workbook https://sre.google/books/
( “成功したイベント数 / イベントの総数” がよく用いられている) • 成功したHTTPリクエストの数/ HTTPリクエストの総数(成功率) • 100ミリ秒未満で正常に完了したgRPC呼び出しの数/ gRPCリクエストの合計 • サービスレベル目標:SLO (Service Level Objective) – SLIを用いて定められるサービスの目標値 参考 The Site Reliability Engineering Workbook https://sre.google/books/ Even if you could achieve 100% reliability within your system, your customers would not experience 100% reliability. The chain of systems between you and your customers is often long and complex, and any of these components can fail. 「The Site Reliability Workbook」より 100% is the wrong reliability target for basically everything 「Site Reliability Engineering」より 参考 Site Reliability Engineering https://sre.google/books/
Service discovery – Pull型の欠点として,ターゲットが追加されるたびに Configの変更が必要なことが上げられるがこの機能により 自動化することができる • OpenMetrics – 対象の名前、値に加えてラベルもつ形式のデータ # HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total counter http_requests_total{method="post",code="200"} 1027 1395066363000 http_requests_total{method="post",code="400"} 3 1395066363000
に対応しておけば どの監視ツールでも対応できるようになることが 期待される ADAM GLICK: Given all the work that you're doing with Prometheus, how did that lead you into creating the OpenMetrics Project? RICHARD HARTMANN: Politics. It's really hard for other projects, and especially for other companies, to support something with a different name on it. Even though Prometheus itself doesn't have a profit motive, so we don't have to have sales or anything, it was hard for others to accept stuff which is named after Prometheus in their own product. And this is basically why I decided to do that. • 他の企業が採用しやすいようにわざと切り出している • Prometheusのメトリクスのフォーマットを切り出したものがOpenMetiricsだが 今では、OpenMetiricsで定義してそれをPrometheusで取り込む形にシフトしている
I took Kibana as a starting point when I started working on Grafana. And I really wanted that easy to use dashboard experience, combined with an easy query builder experience. Because that's where many of my teammates struggling, editing, and understanding the graphic queries. Because it's a small text box. It's a long query nested structure. Kubernetes Podcast「Grafana, with Torkel Ödegaard」より • 2014年にGithubにOSSとして公開 • Kibana v3 を参考に作られてる • 簡単にqueryを可視化できることを目指した
team. The common wisdom used to be, you retain your data for two weeks, and you drop everything which is older. Personally, I kept all data since late 2015. So data which we collected back then is still available today. The truth is probably somewhere in the middle for most users. If you really care about persisting your data long time, you would probably be looking at something like Cortex or Thanos or Influx Data. Or one of those other tools where we have our remote read/write API, where you can just push data to those other systems and persist data over there. Kubernetes Podcast「Prometheus and OpenMetrics, with Richard Hartmann」より 参考 Prometheus and OpenMetrics, with Richard Hartmann https://kubernetespodcast.com/episode/037-prometheus-and-openmetrics/ 私個人の意見としては 内に保管するメトリクスは 週間分に留めておいて、 それ以上を保管したい場合は、 機能を使うのがおすすめです
# HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 289 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.15.2"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 3.63520904e+08 監視専用 一番シンプルな例 毎日誰かが定刻に画面を見てチェックするなどの人力ではなく システムとして設計した方が良いです のメトリクスの例
Workbook」より 1. Binary reporting: Check that the exported metric variables change in value under certain conditions as expected. 2. Monitoring configurations: Make sure that rule evaluation produces expected results, and that specific conditions produce the expected alerts. 3. Alerting configurations: Test that generated alerts are routed to a predetermined destination, based on alert label values. メトリクスの正常性 の構成 期待したアラートが 出力されるか 監視システムに当然ながらテストが 必要になります
Our new partnership with AWS gives Grafana users more options https://grafana.com/blog/2020/12/15/announcing-amazon-managed-service-for-grafana/ をベースに実現しているので、 同じアーキテクチャを使用している にも 進展がないか期待しています
how OM relates to the wider CNCF ecosystem https://github.com/OpenObservability/OpenMetrics/issues/137 参考 https://docs.google.com/document/d/17r3BW7-DBtdNNJ_PRvAlrnbj_RsMDkSv_lljOhI03HI/edit#heading=h.k8l30yl2bec4 それぞれの開発者同士で話し合いが行われているようですが、 統合するというよりは、スコープとして重なる部分があるので その辺りの境界を定めていく方向になるのかなと思っています
– プレビュー開始 – Amazon Managed Service for Prometheus (AMP) < https://aws.amazon.com/jp/blogs/news/join-the-preview-amazon-managed-service-for-prometheus-amp/ > – A Comprehensive Analysis of Open-Source Time Series Databases < https://www.alibabacloud.com/blog/a-comprehensive-analysis-of-open-source-time-series-databases-4_594733?spm=a2c41.12826437.0.0 > – Loki: Prometheus-inspired, open source logging for cloud natives < https://grafana.com/blog/2018/12/12/loki-prometheus-inspired-open-source-logging-for-cloud-natives/ > – An (only slightly technical) introduction to Loki, the Prometheus-inspired open source logging system < https://grafana.com/blog/2020/05/12/an-only-slightly-technical-introduction-to-loki-the-prometheus-inspired-open-source-logging-system/> – Prometheus vs. Graphite: Which Should You Choose for Time Series or Monitoring? < https://logz.io/blog/prometheus-vs-graphite > – Announcing Grafana Tempo, a massively scalable distributed tracing system < https://grafana.com/blog/2020/10/27/announcing-grafana-tempo-a-massively-scalable-distributed-tracing-system/ >
as you’re querying them, and set up alerts within Loki < https://cloud.google.com/blog/ja/products/gcp/welcome-to-the-museum-of-modern-borgmon-art > – ObservabilityCON 2020: Your guide to the newest announcements from Grafana Lab < https://grafana.com/blog/2020/10/26/observabilitycon-2020-your-guide-to-the-newest-announcements-from-grafana-labs/ > – Our new partnership with AWS gives Grafana users more options < https://grafana.com/blog/2020/12/15/announcing-amazon-managed-service-for-grafana/> – Prometheusが利用するOpenMetricsの仕様がIETFに提出された < https://asnokaze.hatenablog.com/entry/2020/11/27/011732 > – An (only slightly technical) introduction to Loki, the Prometheus-inspired open source logging system < https://grafana.com/blog/2020/05/12/an-only-slightly-technical-introduction-to-loki-the-prometheus-inspired-open-source-logging-system/> – Prometheus vs. Graphite: Which Should You Choose for Time Series or Monitoring? < https://logz.io/blog/prometheus-vs-graphite > – Announcing Grafana Tempo, a massively scalable distributed tracing system < https://grafana.com/blog/2020/10/27/announcing-grafana-tempo-a-massively-scalable-distributed-tracing-system/ >
Like Prometheus, but for Logs - Tom Wilkie, Grafana Labs < https://sched.co/MPbj > • KubeCon + CloudNativeCon Europe 2020 Virtual – Prometheus Introduction - Julius Volz, Prometheus < https://sched.co/Zex4> – Scaling Prometheus: How We Got Some Thanos Into Cortex - Thor Hansen, HashiCorp & Marco Pracucci, Grafana Labs < https://sched.co/Zeuw > – Review and define compatibility or incompatibility between OpenMetrics and OpenTelemetry < https://sched.co/ekFx > – What You Need to Know About OpenMetrics - Brian Brazil, Robust Perception & Richard Hartmann, Grafana Labs < https://sched.co/ZevQ >
Evolution of Metric Monitoring and Alerting: Upgrade Your Prometheus Today - Bartlomiej Płotka, Red Hat, Björn Rabenstein & Richard Hartmann, Grafana Labs, & Julius Volz, PromLabs < https://sched.co/ekHn > – Intro to Scaling Prometheus with Cortex - Tom Wilkie, Grafana Labs & Ken Haines, Microsoft < https://sched.co/ekHh > – Observing Cloud Native Observables with the New SIG Observability - Bartlomiej Płotka, Red Hat & Richard Hartmann, Grafana Labs < https://sched.co/ekFx > • ObservabilityCON – Observability with logs & Grafana < https://sched.co/ekHn >
Hartmann < https://kubernetespodcast.com/episode/037-prometheus-and-openmetrics/ > – Grafana, with Torkel Ödegaard < https://kubernetespodcast.com/episode/122-grafana/ > – Monitoring, Metrics and M3, with Martin Mao and Rob Skillington < https://kubernetespodcast.com/episode/084-monitoring-metrics-m3/ >