failure tolerance ⚫Bit-rot tolerance ▌Open source ⚫Detailed evaluation investigation of problems ⚫Use our custom container in case of emergency (see later) 6
▌K8s keeps the desired states 11 Kubernetes Pod node Pod resource PersistenVolumeClaim (PVC) resource PersistenVolume (PV) resource volume 10 GiB volume please! There is A 10 GiB volume! Provisioned by storage Provisioner (driver)
desired state of Ceph clusters The concept of Rook 12 CephCluster Resource CephBlockPool resource Pod resource(MON) Pod resource(MGR) Pod resource(OSD) PV resource (for OSD) PVC resource (for OSD) MON pod MGR pod OSD pod disk node node node admin node Rook pod create watch create K8 cluster
OSDs should be spread evenly over all racks/nodes ▌Typical operations ⚫Create and upgrade clusters ⚫Manage OSDs (add a new one, replace a damaged one) ⚫etc. 14
kind: CephCluster Metadata: name: ceph-ssd … mgr: count: 2 … … mon: count: 3 … storage: … count: 3 • All OSDs are evenly spread over all nodes and racks • Special configurations are necessary (see later) rack rack rack MON pod MGR pod OSD pod node node node node node node OSD pod OSD pod MON pod MON pod MGR pod
⚫ Restarting Ceph pods is necessary after that ▌Some configurations (e.g. “ceph set”) can’t use this way ⚫ Run Ceph commands in the “toolbox” pod 18 kind: ConfigMap metadata: name: rook-config-override data: config: | debug rgw = 5/5
CephCluster … storage: … count: 6 Change from 3 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod
job/script Replace a damaged OSD 20 rack rack rack OSD pod node node node node node node OSD pod OSD pod OSD pod OSD pod OSD pod MON pod MGR pod MON pod MON pod MGR pod OSD pod remove create A home-maid script
container images will be upgraded after that ▌Other work might be needed ⚫See files under “Documentation/Upgrade” of Rook repository 21 kind: CephCluster … image: ceph/ceph:v17.2.6 Change from “ceph/ceph:v17.2.5”
arbitrary nodes by default ⚫ OSDs might be unevenly spread 24 node node kind: CephCluster … storage: … count: 3 One node lost results in data loss node OSD OSD OSD
Rook create a PVC resources for an OSD 2. K8s bind a PV resource to this PVC resource 3. Rook creates an OSD on top of the block device corresponding to the PV ▌It’s suspended if the PV resource isn’t available 27 OSD OSD Ceph cluster OSD A Storage provisioner Not provisioned yet kind: CephCluster … storage: … count: 3
the corresponding LVM logical volumes are created on OSD creation ▌HDD ⚫Use Local persistent volume ⚫Create PVs for all HDDs on deploying nodes ⚫PV resources are already available on OSD creation 29
Rook and Ceph projects every day ⚫Watch the important bugfix PRs and backport them if necessary ▌e.g. a data lost bug in RGW (PR#49795) ⚫Some object might be gone on bucket index resharding ⚫Pending resharding operation until the upgrade to >= v17.2.6 31
the long-term maintenance cost ⚫ Make both communities better ▌Major contributions ⚫ Implemented the Rook’s advanced configurations ⚫ I’ve been working as a Rook maintainer ⚫ Resolved some problems in containerized Ceph clusters ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters 32
and we can’t wait the next release, we want to use our custom containers ⚫ Official release + critical patches ▌Have being tried to run Teuthology in our test environment to verify custom containers ⚫ Succeeded in running all tests, but most tests still fail ⚫ We’ll continue to fix this problem 33