integration tests in VM-based test environments ⚫Each test environment emulates the whole data center having two Rook/Ceph clusters 2. If this test passed, apply changes to the on- premise staging system 3. If the staging system works fine, apply changes to the on-premise production system 8
Rook/Ceph clusters (totally > 10 OSDs) ⚫Restarts all nodes (it implies all OSDs) ⚫This test runs > 10 times a day ▌Restart all nodes of the staging environment once per week ⚫To verify our infrastructure’s availability and update firmware ⚫The staging environment has about 100 OSDs 9
Rook ⚫https://tracker.ceph.com/issues/51034 ▌Failed to prepare a new OSD ▌Due to the inconsistency of write I/O ▌Fixed by ⚫ https://github.com/ceph/ceph/pull/42424 13
⚫https://tracker.ceph.com/issues/57507 ▌Bluefs corruption just after restarting an OSD ▌Due to collision between BlueFS and BlueStore deferred write ▌Probably fixed by ⚫https://github.com/ceph/ceph/pull/46890 16
first time ▌Most bugs were detected in OSD on HDD ⚫HDD seems to be good for detecting these kinds of bugs ▌All bugs can be reproduced by stress tests ⚫Creating/restarting OSDs continuously 18
⚫The records of “pulpito.ceph.com” during Sep. 2022 ▌OSD creation: > 120000/month ⚫~30000 jobs * ~4 OSDs per tests ▌OSD restart: < 500/month ⚫Only happen in upgrade tests ⚫# of upgrade test cases is small ⚫Usually only runs before the new releases 22
have both HDD and NVMe SSD ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw are:infrastructure ▌I couldn’t find whether the official QA uses HDD for OSD or not ⚫Please let me know if you know anything about it 24
be nice to detect bluefs corruption bugs just after restarting OSD ▌If we use 10 OSDs and restarting one OSD takes 3 minutes, # of restarting OSD is ~5000 per day ⚫Ref. 3000/month(Cybozu) and 500/month(the Official QA) 26
many bluefs corruption bugs ▌The key factors ⚫The frequency of creating/restarting OSDs ⚫OSD on HDD ▌The official QA would be better if ⚫Adding the stress test of restarting OSD ⚫Use HDD for OSD (if it hasn’t been used) 29