Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters Nov. 11st,
2022 Cybozu, Inc. Satoru Takeuchi 1

Table of Contents 2 ▌Cybozu and our infrastructure ▌What kind
of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion

What is Cybozu ▌A leading cloud service provider in Japan
▌Providing software that supports teamwork 4

Cybozu’s infrastructure ▌The current infrastructure is a traditional VM-based system
▌Developing a new modern containerized infrastructure ⚫Using Kubernetes ⚫Using Rook/Ceph as storage 5

The characteristics of modern containerized systems ▌Easy to deploy/manage than
traditional systems ▌Restart both nodes and containers frequently ▌Usually run integration tests per PR ⚫Create/restart containers during tests 6

The characteristics of the development of our infrastructure ▌Create Ceph
clusters frequently ▌Create and restart OSDs frequently 7

Development strategy 1. Every change (PR) to our infrastructure kicks
integration tests in VM-based test environments ⚫Each test environment emulates the whole data center having two Rook/Ceph clusters 2. If this test passed, apply changes to the on- premise staging system 3. If the staging system works fine, apply changes to the on-premise production system 8

When we create/restart OSDs ▌For each integration test ⚫Creates two
Rook/Ceph clusters (totally > 10 OSDs) ⚫Restarts all nodes (it implies all OSDs) ⚫This test runs > 10 times a day ▌Restart all nodes of the staging environment once per week ⚫To verify our infrastructure’s availability and update firmware ⚫The staging environment has about 100 OSDs 9

# of OSD creation/restart ▌Estimated values from test/operation logs ⚫3000/month
for both creating/restarting OSD ▌I believe that it’s a far larger number than traditional non-containerized Ceph clusters 10

The bugs we have revealed ▌3 bugs: failed to create
a new OSD ▌2 bugs: bluefs corruption just after restarting an OSD 12

List of bugs (1/5) ▌osd: failed to initialize OSD in
Rook ⚫https://tracker.ceph.com/issues/51034 ▌Failed to prepare a new OSD ▌Due to the inconsistency of write I/O ▌Fixed by ⚫ https://github.com/ceph/ceph/pull/42424 13

List of bugs (2/5) ▌OSD::mkfs: ObjectStore::mkfs failed with error (5)
Input/output error ⚫https://tracker.ceph.com/issues/54019 ▌Failed to prepare a new OSD ▌Not fixed yet 14

List of bugs (3/5) ▌failed to start new osd due
to SIGSEGV in BlueStore::read() ⚫https://tracker.ceph.com/issues/53184 ▌Succeeded to prepare a new OSD but failed to start this OSD ▌Not fixed yet 15

List of bugs (4/5) ▌rocksdb crushed due to checksum mismatch
⚫https://tracker.ceph.com/issues/57507 ▌Bluefs corruption just after restarting an OSD ▌Due to collision between BlueFS and BlueStore deferred write ▌Probably fixed by ⚫https://github.com/ceph/ceph/pull/46890 16

List of bugs (5/5) ▌bluefs corrupted in a OSD ⚫https://tracker.ceph.com/issues/48036
▌bluefs corruption just after restarting an OSD ▌Due to the lack of mutual exclusion in Rook ▌Fixed by ⚫https://github.com/rook/rook/pull/6793 17

Additional information ▌We hit many of these bugs for the
first time ▌Most bugs were detected in OSD on HDD ⚫HDD seems to be good for detecting these kinds of bugs ▌All bugs can be reproduced by stress tests ⚫Creating/restarting OSDs continuously 18

Questions ▌Why Cybozu hit these bugs for the first time?
▌Why Ceph’s official QA process hasn’t detected these bugs? 20

My hypotheses ▌The number of OSD creation/restart in Cybozu is
far larger than the official QA ▌The official QA hasn’t used HDD for OSD 21

# of OSD creation/restart in the official QA ▌Original data
⚫The records of “pulpito.ceph.com” during Sep. 2022 ▌OSD creation: > 120000/month ⚫~30000 jobs * ~4 OSDs per tests ▌OSD restart: < 500/month ⚫Only happen in upgrade tests ⚫# of upgrade test cases is small ⚫Usually only runs before the new releases 22

Comparing the numbers 23 ▌My hypotheses were… ⚫not correct about
OSD creation ⚫correct about OSD restart Cybozu The Official QA # of OSD creation/month > 3000 > 120000 # of OSD restart/month > 3000 < 500

Does official QA use HDD? ▌Many machines in Sepia Lab
have both HDD and NVMe SSD ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw are:infrastructure ▌I couldn’t find whether the official QA uses HDD for OSD or not ⚫Please let me know if you know anything about it 24

Proposal(1/2) ▌Adding the stress tests about restarting OSD ▌It would
be nice to detect bluefs corruption bugs just after restarting OSD ▌If we use 10 OSDs and restarting one OSD takes 3 minutes, # of restarting OSD is ~5000 per day ⚫Ref. 3000/month(Cybozu) and 500/month(the Official QA) 26

Proposal(2/2) ▌If the official QA hasn’t used HDD for OSD,
using them would improves the official QA ▌It also makes the stress test of restarting OSD better 27

Conclusion ▌The development of Cybozu’s modern containerized Ceph clusters revealed
many bluefs corruption bugs ▌The key factors ⚫The frequency of creating/restarting OSDs ⚫OSD on HDD ▌The official QA would be better if ⚫Adding the stress test of restarting OSD ⚫Use HDD for OSD (if it hasn’t been used) 29

Thank you! 30

Revealing BlueStore Corruption Bugs in Containe...

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Satoru Takeuchi PRO

More Decks by Satoru Takeuchi

Other Decks in Technology

Featured

Transcript

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters Nov. 11st,

Table of Contents 2 ▌Cybozu and our infrastructure ▌What kind

Table of Contents 3 ▌Cybozu and our infrastructure ▌What kind

What is Cybozu ▌A leading cloud service provider in Japan

Cybozu’s infrastructure ▌The current infrastructure is a traditional VM-based system

The characteristics of modern containerized systems ▌Easy to deploy/manage than

The characteristics of the development of our infrastructure ▌Create Ceph

Development strategy 1. Every change (PR) to our infrastructure kicks

When we create/restart OSDs ▌For each integration test ⚫Creates two

# of OSD creation/restart ▌Estimated values from test/operation logs ⚫3000/month

Table of Contents 11 ▌Cybozu and our infrastructure ▌What kind

The bugs we have revealed ▌3 bugs: failed to create

List of bugs (1/5) ▌osd: failed to initialize OSD in

List of bugs (2/5) ▌OSD::mkfs: ObjectStore::mkfs failed with error (5)

List of bugs (3/5) ▌failed to start new osd due

List of bugs (4/5) ▌rocksdb crushed due to checksum mismatch

List of bugs (5/5) ▌bluefs corrupted in a OSD ⚫https://tracker.ceph.com/issues/48036

Additional information ▌We hit many of these bugs for the

Table of Contents 19 ▌Cybozu and our infrastructure ▌What kind

Questions ▌Why Cybozu hit these bugs for the first time?

My hypotheses ▌The number of OSD creation/restart in Cybozu is

# of OSD creation/restart in the official QA ▌Original data

Comparing the numbers 23 ▌My hypotheses were… ⚫not correct about

Does official QA use HDD? ▌Many machines in Sepia Lab

Table of Contents 25 ▌Cybozu and our infrastructure ▌What kind

Proposal(1/2) ▌Adding the stress tests about restarting OSD ▌It would

Proposal(2/2) ▌If the official QA hasn’t used HDD for OSD,

Table of Contents 28 ▌Cybozu and our infrastructure ▌What kind

Conclusion ▌The development of Cybozu’s modern containerized Ceph clusters revealed

Thank you! 30