Our infrastructure ▌We have an on-premise infrastructure ▌Current system has many problems ⚫Not scalable ⚫A lot of day 2 work ▌Developing a new infrastructure ⚫Kubernetes(K8s) ⚫Ceph and Rook(Ceph orchestration in K8s) 5
Why Rook? ▌Manage storage system in k8s way as other system components ▌Offload a lot of work to K8s ⚫Lifecycle management of hardware ⚫Mon failover ⚫Restart problematic Ceph daemons 7
What is K8s ▌A container orchestration ▌All services work as “pod” ⚫Pod is a set of containers 10 Kubernetes node Pod Container Container Pod Pod node node … Pod Pod Pod Pod Pod Pod
The concept of K8s ▌All configurations are described as resources ▌K8s keeps the desired states 11 Kubernetes Pod node Pod resource PersistenVolumeClaim (PVC) resource PersistenVolume (PV) resource volume 10 GiB volume please! There is A 10 GiB volume! Provisioned by storage Provisioner (driver)
▌All Ceph components are described as resources ▌Rook keeps the desired state of Ceph clusters The concept of Rook 12 CephCluster Resource CephBlockPool resource Pod resource(MON) Pod resource(MGR) Pod resource(OSD) PV resource (for OSD) PVC resource (for OSD) MON pod MGR pod OSD pod disk node node node admin node Rook pod create watch create K8 cluster
Our Rook/Ceph clusters ▌Requirements ⚫3 replicas ⚫Rack failure tolerance ⚫All OSDs should be spread evenly over all racks/nodes ▌Typical operations ⚫Create and upgrade clusters ⚫Manage OSDs (add a new one, replace a damaged one) ⚫etc. 14
▌Just create the following resource 15 Create a Ceph cluster kind: CephCluster Metadata: name: ceph-ssd … mgr: count: 2 … … mon: count: 3 … storage: … count: 3 • All OSDs are evenly spread over all nodes and racks • Special configurations are necessary (see later) rack rack rack MON pod MGR pod OSD pod node node node node node node OSD pod OSD pod MON pod MON pod MGR pod
Configurations ▌ Edit the “rook-config-override” ConfigMap resource corresponding to “/etc/ceph.conf” ⚫ Restarting Ceph pods is necessary after that ▌Some configurations (e.g. “ceph set”) can’t use this way ⚫ Run Ceph commands in the “toolbox” pod 18 kind: ConfigMap metadata: name: rook-config-override data: config: | debug rgw = 5/5
▌Just edit the CephCluster resource 19 Expand a cluster kind: CephCluster … storage: … count: 6 Change from 3 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod
▌Just run our home-made script ⚫There isn’t such the official job/script Replace a damaged OSD 20 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod OSD pod remove create A home-maid script
Upgrade Rook and Ceph ▌Edit the CephCluster resource ▌All Ceph container images will be upgraded after that ▌Other work might be needed ⚫See files under “Documentation/Upgrade” of Rook repository 21 kind: CephCluster … image: ceph/ceph:v17.2.6 Change from “ceph/ceph:v17.2.5”
Troubleshooting ▌Same as other Ceph orchestrations ⚫Running Ceph commands ⚫Referring logs and metrics ⚫Report bugs to upstream ▌Rook is convenient but is not a silver bullet ⚫Rook can’t debug and fix bugs 22
A subject about even OSD deployment ▌K8s deploy Pods to arbitrary nodes by default ⚫ OSDs might be unevenly spread 24 node node kind: CephCluster … storage: … count: 3 One node lost results in data loss node OSD OSD OSD
A subject about automatic OSD deployment ▌OSD creation flow 1. Rook create a PVC resources for an OSD 2. K8s bind a PV resource to this PVC resource 3. Rook creates an OSD on top of the block device corresponding to the PV ▌It’s suspended if the PV resource isn’t available 27 OSD OSD Ceph cluster OSD A Storage provisioner Not provisioned yet kind: CephCluster … storage: … count: 3
In our cluster ▌NVMe SSD ⚫Use TopoLVM ⚫PV resources and the corresponding LVM logical volumes are created on OSD creation ▌HDD ⚫Use Local persistent volume ⚫Create PVs for all HDDs on deploying nodes ⚫PV resources are already available on OSD creation 29
Daily check of upstream Rook/Ceph ▌Check every update of both Rook and Ceph projects every day ⚫Watch the important bugfix PRs and backport them if necessary ▌e.g. a data lost bug in RGW (PR#49795) ⚫Some object might be gone on bucket index resharding ⚫Pending resharding operation until the upgrade to >= v17.2.6 31
Upstream-first development ▌We’ve shared everything with Rook/Ceph communities ⚫ Reduce the long-term maintenance cost ⚫ Make both communities better ▌Major contributions ⚫ Implemented the Rook’s advanced configurations ⚫ I’ve been working as a Rook maintainer ⚫ Resolved some problems in containerized Ceph clusters ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters 32
Running the custom containers ▌If there is a critical bug and we can’t wait the next release, we want to use our custom containers ⚫ Official release + critical patches ▌Have being tried to run Teuthology in our test environment to verify custom containers ⚫ Succeeded in running all tests, but most tests still fail ⚫ We’ll continue to fix this problem 33
Conclusion ▌Rook is an attractive option of Ceph orchestration ⚫Especially if you are familiar with K8s ▌There are some advanced configurations ▌We’ll continue to provide feedbacks to Rook/Ceph communities 36