Upgrade to Pro — share decks privately, control downloads, hide ads and more …

prob_code: k8s

prob_code: k8s

ICTSC2020本戦前日に発生したトラブルについてのLT。Kubernetes上に載せているPrometheus podのPersistentVolumeがフルになってしまって発生したトラブルや障害の問題。

proelbtn

March 09, 2021
Tweet

More Decks by proelbtn

Other Decks in Technology

Transcript

  1. $ kubectl get logs -n monitoring … ... level=warn ts=2021-03-05T18:06:31.492Z

    caller=scrape.go:972 component="scrape manager" scrape_pool=blackbox-exporter target="http://blackbox-exporter.monitoring.svc.cluster.local:9115/pro be?module=icmp&target=10.5.15.194" msg="append failed" err="write to WAL: log samples: write /etc/prometheus-data/wal/00001007: disk quota exceeded" level=warn ts=2021-03-05T18:06:31.492Z caller=scrape.go:987 component="scrape manager" scrape_pool=blackbox-exporter target="http://blackbox-exporter.monitoring.svc.cluster.local:9115/pro be?module=icmp&target=10.5.15.194" msg="appending scrape report failed" err="write to WAL: log samples: write /etc/prometheus-data/wal/00001007: disk quota exceeded"
  2. $ kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE

    csi-cephfs rook-ceph.cephfs.csi.ceph.com Retain Immediate true 11d
  3. Ignoring the PVC: didn't find a plugin capable of expanding

    the volume; waiting for an external controller to process this PVC.
  4. $ kubectl get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE

    csi-cephfs rook-ceph.cephfs.csi.ceph.com Retain Immediate true 11d
  5. $ kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY

    STATUS CLAIM STORAGECLASS REASON AGE pvc-dfb6cde7-e3d0-4e6a-b46d-4ac5dd3d97fb 20Gi RWX Retain Released monitoring/prometheus-pvc csi-cephfs 8d
  6. $ cat pvc.yaml kind: PersistentVolumeClaim apiVersion: v1 metadata: namespace: monitoring

    name: prometheus-pvc-old spec: accessModes: - ReadWriteMany resources: requests: storage: 20Gi volumeName: "pvc-dfb6cde7-e3d0-4e6a-b46d-4ac5dd3d97fb" storageClassName: csi-cephfs
  7. $ cat ds.yaml apiVersion: apps/v1 kind: Deployment ... containers: -

    name: data-access image: ubuntu:18.04 command: ["/bin/sh"] args: ["-c", "while true; do echo hello; sleep 10; done"] volumeMounts: - name: prometheus-data-old mountPath: /old - name: prometheus-data mountPath: /new volumes: - name: prometheus-data-old persistentVolumeClaim: claimName: prometheus-pvc-old - name: prometheus-data persistentVolumeClaim: claimName: prometheus-pvc
  8. 教訓 • ちゃんとストレージのサイジングをしないとだめ ◦ ログやメトリクスは無くなると困る ◦ 1日の収集量とretentionの設定から計算する ◦ 多少の余力を持ってサイジングする •

    書かれているからといって使えるとは限らない ◦ 必要な要件をちゃんと確認する必要があった • 監視基盤が死んだ時の手段を用意しておく ◦ Prometheusが死んだアラートが上がらなかった