システム改善・育成のための障害対応訓練

Slide 1

Slide 1 text

システム改善・育成のための障害対応訓練 September 22, 2023 Matsumoto Hiroki Marketing Cloud Platform Department Rakuten Group, Inc.

Slide 2

Slide 2 text

2 Profile 松本宏紀 ( Matsumoto Hiroki ) • Reliability Engineering Team • Software Engineer • Joined Rakuten in 2020 • Published • OSS: Passenger Go Exporter • Presentation: デプロイメント⼿法を選択する ~ Flagger/Argo Rollouts ~ • GKE + Java + Cassandra → Ruby + Go + Kubernetes ( Private Cloud / AKS ) • Twitter : @hirokimatsumo13

Slide 3

Slide 3 text

3 経験値を引き継ぐ

Slide 4

Slide 4 text

4 Table of Contents 1. Assets 2. Training 3. Examples

Slide 5

Slide 5 text

5 Assets RunBook 障害発⽣時の取り扱い説明書。リカバリ⼿順などをアラートに対してリンク、またはアラート⾃体に埋め込まれる。 Incident Report Template 障害発⽣時の報告⽤テンプレート。ユーザー視点で、どこにどのような影響が発⽣したのかを展開するための形式。 Training 過去の障害発⽣や、システムの構成要素から障害を実際に発⽣させ、リカバリーまでを実際に経験し、システム上の問題や⼈の成⻑ポイントを確認する。

Slide 6

Slide 6 text

6 Training Trainee ☑ 基本的な事は⾃⼰学習で⾜りる ☑ 新卒1-3年⽬ ☑ 役割: Developer + Operator Trainer ☑ 運⽤経験3年以上 ☑ 対象プロダクト有識者 Manager ☑ レポート先 Training Environment ☑ トレーニング専⽤環境 Conductor ☑ 運⽤経験3年以上 ☑ 対象プロダクト有識者擬似障害発⽣検知・影響範囲確認・原因特定・復旧協⼒報告

Slide 7

Slide 7 text

7 Example 1: 応答遅延 Load Balancer Istio App Nginx Proxy Pattern A Nginx request_time : 10 upstream_response_time : 10 App elpased_time : 10 Pattern B Nginx request_time : 10 upstream_response_time : 10 App elpased_time : 0.1 Pattern C Nginx request_time : 10 upstream_response_time : 0.1 App elpased_time : 0.1 https://nginx.org/en/docs/http/ngx_http_log_module.html https://nginx.org/en/docs/http/ngx_http_upstream_module.html Pattern A~Cはそれぞれどこで問題が発⽣してる︖

Slide 8

Slide 8 text

8 Example 1: 応答遅延 Nginx: request_time request processing time in seconds with a milliseconds resolution; time elapsed between the first bytes were read from the client and the log write after the last bytes were sent to the client. 引⽤元︓https://nginx.org/en/docs/http/ngx_http_log_module.html

Slide 9

Slide 9 text

9 Example 1: 応答遅延 Load Balancer Istio App Nginx Proxy 障害発⽣⽅法専⽤PGにてクライアントからパケット送信箇所に遅延を⼊れて再現単純に⾼負荷にする ( CPU/Packet送受信 ) Istio fault-injectionの利⽤単純に⾼負荷にする ( CPU/Packet送受信 )

Slide 10

Slide 10 text

10 Example 1: 応答遅延それぞれ計測される時間がどこからどこまでかを正確に把握するこれを理解していないと、システムの問題であるかどうかも正確に把握できない状況に陥る可能性がある。 SLI (Service Level Indicator)として、どの部分がより適切であるかを理解する APMなどでアプリケーション側だけの応答速度を計測するだけでは不⼗分である事を理解する。

Slide 11

Slide 11 text

11 Example 2: PromQLの正確性 sum(rate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 > 5 実際のシステムに反映する場合、どのような点に考慮すべきか︖

Slide 12

Slide 12 text

12 Example 2: PromQLの正確性 rate/irateの違い Range全体 / 最後2点での計算の差異。 https://prometheus.io/docs/prometheus/latest/querying/functions/ counter resetへの考慮 https://github.com/prometheus/prometheus/issues/1673 そもそもcounterは0から始まるとは限らない。定期的にリセットされる。 ※バージョンによって期間、動きの差異有り

Slide 13

Slide 13 text

13 Example 2: PromQLの正確性 reporter=destination or source destination側はretry分も含まれてしまう場合がある。また逆にmetricsが取れていない場合もあるのでsource側での計測結果を主に利⽤している。 https://istio.io/latest/docs/reference/config/metrics/ response_flag response_flagを⾒て、istio側でどのような問題が発⽣してそのエラーを返したのか確認。 https://istio.io/latest/docs/reference/config/metrics/ https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#config-access- log-format-response-flags Istio App Nginx Proxy destination source

Slide 14

Slide 14 text

14 理論と実践⾒て学び、経験を得て⾃分のものにしていく。「わからないけど、そうなってる」ではなく、なぜそうなってるかを理解していく。ただ基本的にはそういった要素がなくなるように、システムを改善できると良い。

Slide 15

Slide 15 text

No content