Kubernetesバックアップツール Veleroとちょっとした苦労話

Kubernetesバックアップツール Veleroとちょっとした苦労話 2021/3/12 村⽥⼀平

注意事項 2 • 以下のバージョンで話をします。 Velero, Restic︓1.4.3, 1.5.3 Velero Plugin for
vSphere (vSphere Plugin)︓ 1.0.2 ※ほとんど出てきません • ログなどは⾒やすさを優先して⼀部加⼯している箇所もあります • 開発が活発なため、実際に触る際は公式サイトもご参照お願いします https://velero.io/docs/

Who are you? 3 村⽥⼀平 (Murata Ippei) • デグー飼ってます。
• メダカ、オトシンクルス、クーリーローチ、ヤマトヌマエビ、飼ってます。 • TL: 41（あとイーブイ1進化でTL42）

Agenda • セッションの内容説明 • Kubernetesのバックアップ • Veleroとは • Veleroの仕組み •
Veleroの機能 • Veleroで苦労したこと • 最後に 4

セッションの内容説明 6 • 発表の⽬的︓Veleroを知らない⼈に知ってもらう l K8sのバックアップツールの紹介 l バックアップ以外のでの活⽤⽅法 l 使う際の注意事項
l Veleroのすごく細かい話 l バックアップソフトの⽐較 l Veleroの運⽤事例こういう内容はありますこの辺はないです

Kuberneteのバックアップ Kubernetesにバックアップは必要︖ 8 Kubernetesなら⾃動復旧するしとりあえず後回しで TerraformとAnsibleですぐ作り直せるよ

Kuberneteのバックアップ Spotifyの事例︓2018年にやらかした話 9 KubeCon + CloudNativeCon Europe 2019 • ⼿動で誤って本番クラスタの削除を実⾏
（1回⽬） • Terraformのスクリプトミスで本番クラスタ消失（2回⽬） • k8sへの移⾏途中で段階的移⾏を取っていたので、サービス影響はなし • 障害発⽣以降はTerraformで構築後 Ark(現Velero)でバックアップを実施

Veleroとは 11 • Heptio (K8sのFounderが⽴ち上げた会社) が開発したKubernetes⽤バックアップOSS • Kubernetesのリソース・オブジェクト (PVやCRD含む)をバックアップ •
現在はVMwareを中⼼に開発 '17/8 v0.3.0(Initial) '19/5 v1.0.0 '21/1 v1.5.3(Latest)

Veleroの特徴 12 • マルチクラウドサポート • PV、CRDを含めたバックアップ • Namespace単位でのバックアップ • プラグインによる拡張が可能

Veleroのユースケース l 障害対策としてバックアップ・リストア l クラスタのマイグレーション (オンプレミス ⇔ パブリッククラウド） l クラスタの複製
13

バックアップ・リストアこんな時に l クラスタアップデート後に問題があった場合 l 誤って主要なリソース・オブジェクトを削除 14 Kubernetesの環境構築は簡単 || Kubernetesの環境を壊すのも簡単

マイグレーション 15 こんな時に l AWSで作成した開発環境をオンプレの検証環境に移⾏ l オンプレがリソース不⾜なので⼀時的にAWSに移⾏ ※環境依存のオブジェクト・リソースは
別途再作成する必要あり

クラスタの複製こんな時に • 開発者ごとにFluentbit, Prometheus, Harbor導⼊・設定済みのクラスタを⽤意したい • アップデート前に今のクラスタと同等構成で事前検証したい 16
Master Node Worker Node Worker Node Worker Node Master Node Worker Node Worker Node Worker Node Master Node Worker Node Worker Node Worker Node

Veleroの構成 18 Master Node Worker Node Worker Node Worker Node
CRD backups restores schedules backupstoragelocations volumesnapshotlocations podvolumerestores podvolumebacdkups resticrepositories downloadrequests serverstatusrequests deletebackuprequests Velero + Restic Plugin(PV取得⽤プラグイン)導⼊時

CRD backups restores schedules backupstoragelocations volumesnapshotlocations podvolumerestores podvolumebacdkups resticrepositories downloadrequests serverstatusrequests deletebackuprequests Velero + Restic Plugin(PV取得⽤プラグイン)導⼊時 ▪構成はシンプル • 司令塔となるDeployment • 各Nodeに紐づくPVを吸い上げる DaemonSet (※Restic導⼊時)

CRD backups restores schedules backupstoragelocations volumesnapshotlocations podvolumerestores podvolumebacdkups resticrepositories downloadrequests serverstatusrequests deletebackuprequests Velero + Restic Plugin(PV取得⽤プラグイン)導⼊時 ▪CRDは多様 • ほとんどの機能・操作をCRDで実現 → 操作の状況をkubectlでも確認可能 (veleroコマンドの⽅が使えないこと多々) → 消費リソースは抑え気味 • KubernetesのAPIで機能を実⾏ → ⾮同期なので動きが読みづらいことも

Veleroの仕組み 21 オペレータ # velero backup create Master Node Worker
Node Worker Node Worker Node カスタムリソースの作成要求 (Kubernetes API) MinIO S3 ABS GCS etc カスタムリソースの作成要求 (Kubernetes API)

Backup/Restore 23 対象を絞ってのバックアップ・リストアバックアップの例︓ • Namespace • Resource • Label
• クラスタースコープ ※ velero restore create <backup-name> --include-namespaces <namespace1>,<namespace2> velero restore create <backup-name> --include-resources deployments,configmaps velero backup create <backup-name> --selector <key>=<value> velero restore create <backup-name> --include-cluster-resources=false ※Namespaceと紐付かないもの。 PVやStorageClassなど。 kubectl api-resources でNAMSPACEDがfalseになっているリソース。

Backup/Restore 24 バックアップの内容はdescribe --detailsで確認可能 $ velero describe backup <backupname> --details
︓（省略） Resource List: addons.cluster.x-k8s.io/v1alpha3/ClusterResourceSet: - default/dummyworkload-cni-antrea - default/dummyworkload-csi - default/dummyworkload-default-storage-class - default/dummyworkload-tkg-metadata addons.cluster.x-k8s.io/v1alpha3/ClusterResourceSetBinding: - default/dummyworkload apiextensions.k8s.io/v1/CustomResourceDefinition: - apps.kappctrl.k14s.io - certificaterequests.cert-manager.io ︓(省略)

バックアップ前後処理の実⾏ 25 バックアップ前後に任意のコマンドが実⾏可能 kubectl annotate pod -n nginx-example -l app=nginx
¥ pre.hook.backup.velero.io/command='["/sbin/fsfreeze", "--freeze", "/var/log/nginx"]' ¥ pre.hook.backup.velero.io/container=fsfreeze ¥ post.hook.backup.velero.io/command='["/sbin/fsfreeze", "--unfreeze", "/var/log/nginx"]' ¥ post.hook.backup.velero.io/container=fsfreeze ※注意点 • コンテナ内にコマンドがない場合、コマンドが実⾏できるコンテナをPodに⼊れる必要あり • fsfreezeなどボリューム操作を伴う場合はvolumeを共有する必要あり • コマンドによっては強⼒な権限(privileged: true)が必要になる場合ありファイルシステムの静⽌化などで活⽤可能

リストア前後処理の実⾏ 26 リストア前の処理についてはinitコンテナを⽴ち上げることで実施 $ kubectl annotate pod -n <POD_NAMESPACE> <POD_NAME>
\ init.hook.restore.velero.io/container-name=restore-hook \ init.hook.restore.velero.io/container-image=alpine:latest \ init.hook.restore.velero.io/command='["/bin/ash", "-c", "date"]' ※注意点 • リストア前の処理はv1.4.x以前だと未対応。v1.5以降を使う必要ありリストア後の処理指定はバックアップ時と同様

⼩ネタ１進捗はveleroコマンドではなくPodから確認する 27 • バックアップの細かい進捗状況はveleroコマンドから確認不可 • PluginによってはVeleroの完了と⾮同期でバックアップを実⾏する ※後述、要注意 $ kubectl
logs -n velero deploy/velero -f ︓（省略） time="2021-02-26T02:52:54Z" level=info msg="Processing item" backup=velero/cndo1151 logSource="pkg/backup time="2021-02-26T02:52:54Z" level=info msg="Backing up item" backup=velero/cndo1151 logSource="pkg/backu time="2021-02-26T02:52:54Z" level=info msg="Backed up 779 items out of an estimated total of 785 (estimate $ velero restore logs testbk Logs for restore "testbk" are not available until it's finished processing.

⼩ネタ２ veleroコマンドにない機能はKubernetesリソースから操作（原則的には⾮推奨）例） 28 Usage: velero snapshot-location [command] Available
Commands: create Create a volume snapshot location get Get snapshot locations Deleteが提供されていない $ velero get snapshot-locations NAME PROVIDER default aws vsl-vsphere velero.io/vsphere $ kubectl delete volumesnapshotlocations.velero.io -n velero vsl-vsphere volumesnapshotlocation.velero.io "vsl-vsphere" deleted $ velero get snapshot-locations NAME PROVIDER default aws CRDのオブジェクトを消す形で代⽤可能

⼩ネタ３ K8sのバージョンが異なるリストアについては保証されていない Veleroのエンジニアコメント︓ 29 • LTͷόʔδϣϯαϙʔτͷϙϦγʔΑΓɺ Ξοϓσʔτ௚ޙͷ"1*ޓ׵ੑ͸ظ଴͍ͯ͠Δ • ͨͩ͠ɺඇޓ׵ʹͳΔ໰୊΋ग़ͯ͘Δͱߟ͍͑ͯΔ •
ಛʹLTΫϥελͷ௕ظతͳอଘ͸ࠓޙͷ՝୊Ͱ͋Δ • ,Tͷ761લޙͱఆظతͳόοΫΞοϓΛਪ঑͢Δ PV⽤プラグインがサポートするVeleroのバージョンも注意例）https://github.com/vmware-tanzu/velero-plugin-for-aws#compatibility

苦労1︓あるある話 31 Resticを⼊れているのに、PVが取得されない原因︓ 1. PodのAnnotationに"backup.velero.io/backup-volumes=<volume名>" を書いている場合、Resticはそのボリュームをバックアップする 2. Annotationは誰も⾃動では書いてくれない ※1.5.1以降は⾃動化オプションあり
# mc ls -r minio [2021-02-24 18:52:05 PST] 29B tkg/backups/fuga/fuga-csi-volumesnapshotcontents.json.gz [2021-02-24 18:52:05 PST] 29B tkg/backups/fuga/fuga-csi-volumesnapshots.json.gz ︓(pluginの下に何も作られない）

苦労1︓あるある話 32 回避策︓ Annotationでバックアップ対象を指定する or --default-volumes-to-restic オプションをつけてインストールしておく (v1.5.1以降) kubectl annotate
pod testpod backup.velero.io/backup-volumes=testvol ︓ volumes: - name: testvol persistentVolumeClaim: claimName: velerotest containers: - name: busybox image: busybox-test:1.29 command: [ "sleep", "365d" ] ︓

苦労1︓あるある話 33 回避策︓ Annotationでバックアップ対象を指定する or --default-volumes-to-restic オプションをつけてインストールしておく (v1.5.1以降) kubectl annotate
pod testpod backup.velero.io/backup-volumes=testvol ︓ volumes: - name: testvol persistentVolumeClaim: claimName: velerotest containers: - name: busybox image: busybox-test:1.29 command: [ "sleep", "365d" ] ︓ 教訓 • ドキュメントは部分的ではなく⼀通り読む ※ドキュメントがあまり整理されていない

苦労2︓オフライン固有 34 オフライン環境で、PV含むリストア時にInProgressで固まる。 kubectl get podすると、リストア対象のPodがInitコンテナを持っていないのに何故かInit処理で死んでいる。原因︓ 1. resticプラグイン利⽤時、PVリストアの際にInitコンテナを作成する 2.
Initコンテナのイメージをオフライン環境に持ち込んでいない 3. Initコンテナのイメージの参照先が変更されていない $ kubectl describe pod -n velerotest testpod Warning Failed 5m7s (x4 over 6m37s) Error: ErrImagePull Warning Failed 4m29s (x7 over 6m37s) Error: ImagePullBackOff Normal BackOff 94s (x19 over 6m37s) Back-off pulling image "velero/velero-restic-restore-helper:v1.4.3"

苦労2︓オフライン固有 35 回避策︓ • velero-restic-restore-helperを持ち込む • Configmapを作成してimageのパスを指定する velero.io/docsの記載より分かりづらい点︓ •
「Air-gapped deployments」のところに Configmapの話がない • 「Restic Integration」の箇所には記載があるが、⽬的がカスタマイズするためとなっており、問題と関連付けにくい • デフォルトのConfigmapもなく、記載箇所を⾒つけられなかった場合にimageの指定箇所の特定が⾮常に困難

苦労2︓オフライン固有 36 回避策︓ • velero-restic-restore-helperを持ち込む • Configmapを作成してimageのパスを指定する velero.io/docsの記載より分かりづらい点︓ •
「Air-gapped deployments」のところに Configmapの話がない • 「Restic Integration」の箇所には記載があるが、⽬的がカスタマイズするためとなっており、問題と関連付けにくい • デフォルトのConfigmapもなく、記載箇所を⾒つけられなかった場合にimageの指定箇所の特定が⾮常に困難教訓 • ドキュメントは部分的ではなく⼀通り読む ※ドキュメントがあまり整理されていない

苦労3︓完了の確認 37 バックアップ実⾏後、状態がCompletedになったにも関わらず、プラグイン側は完了せず動き続けているように⾒える。この状態でリストアを実⾏するとPVのリストアに失敗する $ kubectl logs -n velero
ds/datamgr-for-vsphere-plugin -f ：（省略） msg="Upload ongoing, Part: 84 Bytes Uploaded: 840 MB" msg="Read returning 10485760, len(p) = 10485760, offset=891289600\n" msg="Upload ongoing, Part: 85 Bytes Uploaded: 850 MB" msg="Read returning 10485760, len(p) = 10485760, offset=901775360\n" 原因︓ vSphere Plugin利⽤時、PVのバックアップはVeleroのバックアップと⾮同期で実施される

苦労3︓完了の確認 38 回避策︓ プラグインのログからPVのバックアップの完了を確認してからリストアするまたはCRDの状態から確認する $ kubectl logs -n velero
ds/datamgr-for-vsphere-plugin ：（省略） msg="Upload status updated from InProgress to Completed" msg="Upload Completed" $ kubectl get upload -n velero -o jsonpath={.items[*].status.phase} Completed ※プラグインによって挙動が違うと考えられるため、利⽤するプラグインごとに確認すること

苦労3︓完了の確認 39 回避策︓ プラグインのログからPVのバックアップの完了を確認してからリストアするまたはCRDの状態から確認する $ kubectl logs -n velero
ds/datamgr-for-vsphere-plugin ：（省略） msg="Upload status updated from InProgress to Completed" msg="Upload Completed" $ kubectl get upload -n velero -o jsonpath={.items[*].status.phase} Completed ※プラグインによって挙動が違うと考えられるため、利⽤するプラグインごとに確認すること教訓 • プラグインの扱いが本体と分離している • Velero本体の挙動だけでなく、プラグインの挙動を理解した上で利⽤する

苦労4︓本体のバグ E0105 21:55:15.865928 1 runtime.go:78] Observed a panic: "invalid memory
address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 551 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x163e040, 0x26dfda0) /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) 40 CAPIを含むクラスタをリストアすると、 capi-controller-managerがリストア後にpanicする原因︓ 1. Veleroはリソースをアルファベット順で復元する 2. CRDによっては、あるリソースのオブジェクトが先に作られていることを前提としている（今回の場合はClusterResourceSetBindingsはClusterResourceSetsが先に起動するのを前提としている）回避策︓ Veleroの引数で起動順序を指定する（--restore-resource-priorities) ※このオプションで指定したリソースが優先的にリストア指定しなかったものはアルファベット順 ※v1.6.0で対策予定参考︓ https://github.com/kubernetes-sigs/cluster-api/issues/4105 https://github.com/vmware-tanzu/velero/pull/3446

苦労4︓本体のバグ E0105 21:55:15.865928 1 runtime.go:78] Observed a panic: "invalid memory
address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 551 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic(0x163e040, 0x26dfda0) /go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3 k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) 41 CAPIを含むクラスタをリストアすると、 capi-controller-managerがリストア後にpanicする原因︓ 1. Veleroはリソースをアルファベット順で復元する 2. CRDによっては、あるリソースのオブジェクトが先に作られていることを前提としている（今回の場合はClusterResourceSetBindingsはClusterResourceSetsが先に起動するのを前提としている）回避策︓ Veleroの引数で起動順序を指定する（--restore-resource-priorities) ※このオプションで指定したリソースが優先的にリストア指定しなかったものはアルファベット順 ※v1.6.0で対策予定参考︓ https://github.com/kubernetes-sigs/cluster-api/issues/4105 https://github.com/vmware-tanzu/velero/pull/3446 教訓 • リコンサイルループで復旧しないリソースもある • CRD関連の挙動を全て把握するのは困難 → ユースケースベースでの事前検証は⼤事 • 怪しいと思ったら中の⼈に早く聞く（velero bugコマンド)

苦労5︓プラグインのバグ # mc ls -r minio [2021-02-23 22:33:58 PST] 0B
tkg/plugins/vsphere-astrolabe-repo/ivd/data/ivd:xxxx [2021-02-23 22:33:59 PST] 3.1KiB tkg/plugins/vsphere-astrolabe-repo/ivd/md/ivd:xxxx [2021-02-23 22:33:59 PST] 984B tkg/plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:xxxx 42 PVのバックアップが正常に成功するも、サイズがゼロbyte ※vSphere plugin利⽤時原因︓ 1. PVの要求サイズをテスト⽤に1MBと⼩さい値を指定していた 2 . vSphere Pluginが10MB未満のPVはデータを取りに⾏かない作りになっていた対策︓ PVのサイズを変更（念の為100MB程度に）

苦労5︓ プラグインのバグ # mc ls -r minio [2021-02-23 22:33:58 PST]
0B tkg/plugins/vsphere-astrolabe-repo/ivd/data/ivd:xxxx [2021-02-23 22:33:59 PST] 3.1KiB tkg/plugins/vsphere-astrolabe-repo/ivd/md/ivd:xxxx [2021-02-23 22:33:59 PST] 984B tkg/plugins/vsphere-astrolabe-repo/ivd/peinfo/ivd:xxxx 43 PVのバックアップが正常に成功するも、サイズがゼロbyte ※vSphere plugin利⽤時原因︓ 1. PVの要求サイズをテスト⽤に1MBと⼩さい値を指定していた 2 . vSphere Pluginが10MB未満のPVはデータを取りに⾏かない作りになっていた対策︓ PVのサイズを変更（念の為100MB程度に）教訓 • 仕様として明記されていないものはテストされていない可能性がある • 実態にあったケースでテストする

最後に⾔いたかったこと • Kubernetes向けのバックアップソフトを活⽤しよう︕ • Veleroがんばってるよ︕ • ハマりどころがちょいちょいあるので、事前検証はしっかりやろう︕ • 公式ドキュメントは頑張って隅々まで読もう
• 何か⾒つけたらvelero bugコマンド 45

ご静聴ありがとうございました 46

Kubernetesバックアップツール Veleroとちょっとした苦労話

Kubernetesバックアップツール Veleroとちょっとした苦労話

More Decks by ipppppei

Other Decks in Technology

Featured

Transcript