Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Slide 1

Slide 1 text

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters Nov. 11st, 2022 Cybozu, Inc. Satoru Takeuchi 1

Slide 2

Slide 2 text

Table of Contents 2 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 3

Slide 3 text

Table of Contents 3 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 4

Slide 4 text

What is Cybozu ▌A leading cloud service provider in Japan ▌Providing software that supports teamwork 4

Slide 5

Slide 5 text

Cybozu’s infrastructure ▌The current infrastructure is a traditional VM-based system ▌Developing a new modern containerized infrastructure ⚫Using Kubernetes ⚫Using Rook/Ceph as storage 5

Slide 6

Slide 6 text

The characteristics of modern containerized systems ▌Easy to deploy/manage than traditional systems ▌Restart both nodes and containers frequently ▌Usually run integration tests per PR ⚫Create/restart containers during tests 6

Slide 7

Slide 7 text

The characteristics of the development of our infrastructure ▌Create Ceph clusters frequently ▌Create and restart OSDs frequently 7

Slide 8

Slide 8 text

Development strategy 1. Every change (PR) to our infrastructure kicks integration tests in VM-based test environments ⚫Each test environment emulates the whole data center having two Rook/Ceph clusters 2. If this test passed, apply changes to the on- premise staging system 3. If the staging system works fine, apply changes to the on-premise production system 8

Slide 9

Slide 9 text

When we create/restart OSDs ▌For each integration test ⚫Creates two Rook/Ceph clusters (totally > 10 OSDs) ⚫Restarts all nodes (it implies all OSDs) ⚫This test runs > 10 times a day ▌Restart all nodes of the staging environment once per week ⚫To verify our infrastructure’s availability and update firmware ⚫The staging environment has about 100 OSDs 9

Slide 10

Slide 10 text

# of OSD creation/restart ▌Estimated values from test/operation logs ⚫3000/month for both creating/restarting OSD ▌I believe that it’s a far larger number than traditional non-containerized Ceph clusters 10

Slide 11

Slide 11 text

Table of Contents 11 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 12

Slide 12 text

The bugs we have revealed ▌3 bugs: failed to create a new OSD ▌2 bugs: bluefs corruption just after restarting an OSD 12

Slide 13

Slide 13 text

List of bugs (1/5) ▌osd: failed to initialize OSD in Rook ⚫https://tracker.ceph.com/issues/51034 ▌Failed to prepare a new OSD ▌Due to the inconsistency of write I/O ▌Fixed by ⚫ https://github.com/ceph/ceph/pull/42424 13

Slide 14

Slide 14 text

List of bugs (2/5) ▌OSD::mkfs: ObjectStore::mkfs failed with error (5) Input/output error ⚫https://tracker.ceph.com/issues/54019 ▌Failed to prepare a new OSD ▌Not fixed yet 14

Slide 15

Slide 15 text

List of bugs (3/5) ▌failed to start new osd due to SIGSEGV in BlueStore::read() ⚫https://tracker.ceph.com/issues/53184 ▌Succeeded to prepare a new OSD but failed to start this OSD ▌Not fixed yet 15

Slide 16

Slide 16 text

List of bugs (4/5) ▌rocksdb crushed due to checksum mismatch ⚫https://tracker.ceph.com/issues/57507 ▌Bluefs corruption just after restarting an OSD ▌Due to collision between BlueFS and BlueStore deferred write ▌Probably fixed by ⚫https://github.com/ceph/ceph/pull/46890 16

Slide 17

Slide 17 text

List of bugs (5/5) ▌bluefs corrupted in a OSD ⚫https://tracker.ceph.com/issues/48036 ▌bluefs corruption just after restarting an OSD ▌Due to the lack of mutual exclusion in Rook ▌Fixed by ⚫https://github.com/rook/rook/pull/6793 17

Slide 18

Slide 18 text

Additional information ▌We hit many of these bugs for the first time ▌Most bugs were detected in OSD on HDD ⚫HDD seems to be good for detecting these kinds of bugs ▌All bugs can be reproduced by stress tests ⚫Creating/restarting OSDs continuously 18

Slide 19

Slide 19 text

Table of Contents 19 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 20

Slide 20 text

Questions ▌Why Cybozu hit these bugs for the first time? ▌Why Ceph’s official QA process hasn’t detected these bugs? 20

Slide 21

Slide 21 text

My hypotheses ▌The number of OSD creation/restart in Cybozu is far larger than the official QA ▌The official QA hasn’t used HDD for OSD 21

Slide 22

Slide 22 text

# of OSD creation/restart in the official QA ▌Original data ⚫The records of “pulpito.ceph.com” during Sep. 2022 ▌OSD creation: > 120000/month ⚫~30000 jobs * ~4 OSDs per tests ▌OSD restart: < 500/month ⚫Only happen in upgrade tests ⚫# of upgrade test cases is small ⚫Usually only runs before the new releases 22

Slide 23

Slide 23 text

Comparing the numbers 23 ▌My hypotheses were… ⚫not correct about OSD creation ⚫correct about OSD restart Cybozu The Official QA # of OSD creation/month > 3000 > 120000 # of OSD restart/month > 3000 < 500

Slide 24

Slide 24 text

Does official QA use HDD? ▌Many machines in Sepia Lab have both HDD and NVMe SSD ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw are:infrastructure ▌I couldn’t find whether the official QA uses HDD for OSD or not ⚫Please let me know if you know anything about it 24

Slide 25

Slide 25 text

Table of Contents 25 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 26

Slide 26 text

Proposal(1/2) ▌Adding the stress tests about restarting OSD ▌It would be nice to detect bluefs corruption bugs just after restarting OSD ▌If we use 10 OSDs and restarting one OSD takes 3 minutes, # of restarting OSD is ~5000 per day ⚫Ref. 3000/month(Cybozu) and 500/month(the Official QA) 26

Slide 27

Slide 27 text

Proposal(2/2) ▌If the official QA hasn’t used HDD for OSD, using them would improves the official QA ▌It also makes the stress test of restarting OSD better 27

Slide 28

Slide 28 text

Table of Contents 28 ▌Cybozu and our infrastructure ▌What kind of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

Slide 29

Slide 29 text

Conclusion ▌The development of Cybozu’s modern containerized Ceph clusters revealed many bluefs corruption bugs ▌The key factors ⚫The frequency of creating/restarting OSDs ⚫OSD on HDD ▌The official QA would be better if ⚫Adding the stress test of restarting OSD ⚫Use HDD for OSD (if it hasn’t been used) 29

Slide 30

Slide 30 text

Thank you! 30