Slide 1

Slide 1 text

Best Practices of Production-Grade Rook/Ceph Cluster Apr. 17th, 2023 Cybozu, Inc. Satoru Takeuchi 1

Slide 2

Slide 2 text

Agenda 2 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 3

Slide 3 text

Agenda 3 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 4

Slide 4 text

About Cybozu ▌A leading cloud service provider in Japan ▌Providing Web services that supports teamwork 4

Slide 5

Slide 5 text

Our infrastructure ▌We have an on-premise infrastructure ▌Current system has many problems ⚫Not scalable ⚫A lot of day 2 work ▌Developing a new infrastructure ⚫Kubernetes(K8s) ⚫Ceph and Rook(Ceph orchestration in K8s) 5

Slide 6

Slide 6 text

Why Ceph? ▌Fulfill our requirements ⚫block & object storage ⚫Rack failure tolerance ⚫Bit-rot tolerance ▌Open source ⚫Detailed evaluation investigation of problems ⚫Use our custom container in case of emergency (see later) 6

Slide 7

Slide 7 text

Why Rook? ▌Manage storage system in k8s way as other system components ▌Offload a lot of work to K8s ⚫Lifecycle management of hardware ⚫Mon failover ⚫Restart problematic Ceph daemons 7

Slide 8

Slide 8 text

OSD(data) OSD(data) OSD(data) OSD(data) Our storage system 8 Ceph cluster for RGW Ceph cluster for RBD HDD SSD HDD HDD SSD NVMe SSD LogicalVolume LVM VolumeGroup OSD(data) LogicalVolume OSD(index) OSD Applications bucket RBD Applications Applications LogicalVolume node node • 3 replicas • Rack failure tolerance Rook manage

Slide 9

Slide 9 text

Agenda 9 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 10

Slide 10 text

What is K8s ▌A container orchestration ▌All services work as “pod” ⚫Pod is a set of containers 10 Kubernetes node Pod Container Container Pod Pod node node … Pod Pod Pod Pod Pod Pod

Slide 11

Slide 11 text

The concept of K8s ▌All configurations are described as resources ▌K8s keeps the desired states 11 Kubernetes Pod node Pod resource PersistenVolumeClaim (PVC) resource PersistenVolume (PV) resource volume 10 GiB volume please! There is A 10 GiB volume! Provisioned by storage Provisioner (driver)

Slide 12

Slide 12 text

▌All Ceph components are described as resources ▌Rook keeps the desired state of Ceph clusters The concept of Rook 12 CephCluster Resource CephBlockPool resource Pod resource(MON) Pod resource(MGR) Pod resource(OSD) PV resource (for OSD) PVC resource (for OSD) MON pod MGR pod OSD pod disk node node node admin node Rook pod create watch create K8 cluster

Slide 13

Slide 13 text

Agenda 13 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 14

Slide 14 text

Our Rook/Ceph clusters ▌Requirements ⚫3 replicas ⚫Rack failure tolerance ⚫All OSDs should be spread evenly over all racks/nodes ▌Typical operations ⚫Create and upgrade clusters ⚫Manage OSDs (add a new one, replace a damaged one) ⚫etc. 14

Slide 15

Slide 15 text

▌Just create the following resource 15 Create a Ceph cluster kind: CephCluster Metadata: name: ceph-ssd … mgr: count: 2 … … mon: count: 3 … storage: … count: 3 • All OSDs are evenly spread over all nodes and racks • Special configurations are necessary (see later) rack rack rack MON pod MGR pod OSD pod node node node node node node OSD pod OSD pod MON pod MON pod MGR pod

Slide 16

Slide 16 text

Create an RBD pool ▌Just create the following resources 16 kind: CephBlockPool metadata: name: block-pool spec: replicated: size: 3 failureDomain: zone kind: StorageClass metadata: name: ceph-block parameters: clusterID: ceph-ssd pool: block-pool csi.storage.k8s.io/fstype: ext4 Ceph cluster A pool for rbd Name:block-pool Replicated: 3 Failure domain: rack “zone” means “rack” in our clusters ceph-block provisioner

Slide 17

Slide 17 text

Consume an RBD image ▌Just create the following resources 17 kind: PersistentVolumeClaim metadata: name: myclaim spec: … storageClassName: ceph-block kind: Pod metadata: name: mypod volumes: - name: myvolume persistentVolumeClaim: claimName: myclaim Ceph cluster A pool for RBD Name: block-pool Replicas: 3 Failure domain: rack RBD volume mypod create consume ceph-block provisioner

Slide 18

Slide 18 text

Configurations ▌ Edit the “rook-config-override” ConfigMap resource corresponding to “/etc/ceph.conf” ⚫ Restarting Ceph pods is necessary after that ▌Some configurations (e.g. “ceph set”) can’t use this way ⚫ Run Ceph commands in the “toolbox” pod 18 kind: ConfigMap metadata: name: rook-config-override data: config: | debug rgw = 5/5

Slide 19

Slide 19 text

▌Just edit the CephCluster resource 19 Expand a cluster kind: CephCluster … storage: … count: 6 Change from 3 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod

Slide 20

Slide 20 text

▌Just run our home-made script ⚫There isn’t such the official job/script Replace a damaged OSD 20 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod OSD pod remove create A home-maid script

Slide 21

Slide 21 text

Upgrade Rook and Ceph ▌Edit the CephCluster resource ▌All Ceph container images will be upgraded after that ▌Other work might be needed ⚫See files under “Documentation/Upgrade” of Rook repository 21 kind: CephCluster … image: ceph/ceph:v17.2.6 Change from “ceph/ceph:v17.2.5”

Slide 22

Slide 22 text

Troubleshooting ▌Same as other Ceph orchestrations ⚫Running Ceph commands ⚫Referring logs and metrics ⚫Report bugs to upstream ▌Rook is convenient but is not a silver bullet ⚫Rook can’t debug and fix bugs 22

Slide 23

Slide 23 text

Advanced configurations ▌Even OSD deployment over all nodes/racks ▌Automatic OSD deployment 23

Slide 24

Slide 24 text

A subject about even OSD deployment ▌K8s deploy Pods to arbitrary nodes by default ⚫ OSDs might be unevenly spread 24 node node kind: CephCluster … storage: … count: 3 One node lost results in data loss node OSD OSD OSD

Slide 25

Slide 25 text

Solution ▌Use the “TopologySpreadConstraints” feature of K8s ⚫ Spread specific pods evenly over all nodes (or racks, and so on) 25 node kind: CephCluster … storage: … count: 3 … topologySpreadConstraints: - labelSelector: matchExpressions: - key: app operator: In values: - rook-ceph-osd - rook-ceph-osd-prepare topologyKey: topology.kubernetes.io/hostname OSD node OSD node OSD See also: https://blog.kintone.io/entry/2020/09/18/175030

Slide 26

Slide 26 text

rack In our clusters ▌Two constraints for both nodes and racks 26 kind: CephCluster … storage: … count: 12 … topologySpreadConstraints: - labelSelector: … topologyKey: topology.kubernetes.io/zone - labelSelector … topologyKey: topology.kubernetes.io/hostname rack rack node OSD OSD node OSD node OSD node OSD node OSD node OSD OSD OSD OSD OSD OSD

Slide 27

Slide 27 text

A subject about automatic OSD deployment ▌OSD creation flow 1. Rook create a PVC resources for an OSD 2. K8s bind a PV resource to this PVC resource 3. Rook creates an OSD on top of the block device corresponding to the PV ▌It’s suspended if the PV resource isn’t available 27 OSD OSD Ceph cluster OSD A Storage provisioner Not provisioned yet kind: CephCluster … storage: … count: 3

Slide 28

Slide 28 text

Solution ▌Use storage providers supporting dynamic provisioning ⚫e.g. many cloud storages, TopoLVM (for local volumes) ▌Provision PV resources beforehand ⚫e.g. local-static-provisioner 28 storage: … count: 3 volumeClaimTemplates: - spec: storageClassName: nice-provisioner OSD OSD Ceph cluster OSD Nice provisioner Provision on demand or pre-provisioned

Slide 29

Slide 29 text

In our cluster ▌NVMe SSD ⚫Use TopoLVM ⚫PV resources and the corresponding LVM logical volumes are created on OSD creation ▌HDD ⚫Use Local persistent volume ⚫Create PVs for all HDDs on deploying nodes ⚫PV resources are already available on OSD creation 29

Slide 30

Slide 30 text

Agenda 30 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 31

Slide 31 text

Daily check of upstream Rook/Ceph ▌Check every update of both Rook and Ceph projects every day ⚫Watch the important bugfix PRs and backport them if necessary ▌e.g. a data lost bug in RGW (PR#49795) ⚫Some object might be gone on bucket index resharding ⚫Pending resharding operation until the upgrade to >= v17.2.6 31

Slide 32

Slide 32 text

Upstream-first development ▌We’ve shared everything with Rook/Ceph communities ⚫ Reduce the long-term maintenance cost ⚫ Make both communities better ▌Major contributions ⚫ Implemented the Rook’s advanced configurations ⚫ I’ve been working as a Rook maintainer ⚫ Resolved some problems in containerized Ceph clusters ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters 32

Slide 33

Slide 33 text

Running the custom containers ▌If there is a critical bug and we can’t wait the next release, we want to use our custom containers ⚫ Official release + critical patches ▌Have being tried to run Teuthology in our test environment to verify custom containers ⚫ Succeeded in running all tests, but most tests still fail ⚫ We’ll continue to fix this problem 33

Slide 34

Slide 34 text

Other remaining work ▌Backup/restore ▌Remote Replication ▌More automation 34

Slide 35

Slide 35 text

Agenda 35 ▌Cybozu and our storage system ▌Quick introduction to K8s and Rook ▌Advanced Rook/Ceph cluster ▌Efforts and challenges ▌Conclusion

Slide 36

Slide 36 text

Conclusion ▌Rook is an attractive option of Ceph orchestration ⚫Especially if you are familiar with K8s ▌There are some advanced configurations ▌We’ll continue to provide feedbacks to Rook/Ceph communities 36

Slide 37

Slide 37 text

Thank you! 37