$30 off During Our Annual Pro Sale. View Details »

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

My presentation slides at Ceph Virtual
https://ceph.io/en/community/events/2022/ceph-virtual/

Satoru Takeuchi
PRO

November 12, 2022
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters Nov. 11st,

    2022 Cybozu, Inc. Satoru Takeuchi 1
  2. Table of Contents 2 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  3. Table of Contents 3 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  4. What is Cybozu ▌A leading cloud service provider in Japan

    ▌Providing software that supports teamwork 4
  5. Cybozu’s infrastructure ▌The current infrastructure is a traditional VM-based system

    ▌Developing a new modern containerized infrastructure ⚫Using Kubernetes ⚫Using Rook/Ceph as storage 5
  6. The characteristics of modern containerized systems ▌Easy to deploy/manage than

    traditional systems ▌Restart both nodes and containers frequently ▌Usually run integration tests per PR ⚫Create/restart containers during tests 6
  7. The characteristics of the development of our infrastructure ▌Create Ceph

    clusters frequently ▌Create and restart OSDs frequently 7
  8. Development strategy 1. Every change (PR) to our infrastructure kicks

    integration tests in VM-based test environments ⚫Each test environment emulates the whole data center having two Rook/Ceph clusters 2. If this test passed, apply changes to the on- premise staging system 3. If the staging system works fine, apply changes to the on-premise production system 8
  9. When we create/restart OSDs ▌For each integration test ⚫Creates two

    Rook/Ceph clusters (totally > 10 OSDs) ⚫Restarts all nodes (it implies all OSDs) ⚫This test runs > 10 times a day ▌Restart all nodes of the staging environment once per week ⚫To verify our infrastructure’s availability and update firmware ⚫The staging environment has about 100 OSDs 9
  10. # of OSD creation/restart ▌Estimated values from test/operation logs ⚫3000/month

    for both creating/restarting OSD ▌I believe that it’s a far larger number than traditional non-containerized Ceph clusters 10
  11. Table of Contents 11 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  12. The bugs we have revealed ▌3 bugs: failed to create

    a new OSD ▌2 bugs: bluefs corruption just after restarting an OSD 12
  13. List of bugs (1/5) ▌osd: failed to initialize OSD in

    Rook ⚫https://tracker.ceph.com/issues/51034 ▌Failed to prepare a new OSD ▌Due to the inconsistency of write I/O ▌Fixed by ⚫ https://github.com/ceph/ceph/pull/42424 13
  14. List of bugs (2/5) ▌OSD::mkfs: ObjectStore::mkfs failed with error (5)

    Input/output error ⚫https://tracker.ceph.com/issues/54019 ▌Failed to prepare a new OSD ▌Not fixed yet 14
  15. List of bugs (3/5) ▌failed to start new osd due

    to SIGSEGV in BlueStore::read() ⚫https://tracker.ceph.com/issues/53184 ▌Succeeded to prepare a new OSD but failed to start this OSD ▌Not fixed yet 15
  16. List of bugs (4/5) ▌rocksdb crushed due to checksum mismatch

    ⚫https://tracker.ceph.com/issues/57507 ▌Bluefs corruption just after restarting an OSD ▌Due to collision between BlueFS and BlueStore deferred write ▌Probably fixed by ⚫https://github.com/ceph/ceph/pull/46890 16
  17. List of bugs (5/5) ▌bluefs corrupted in a OSD ⚫https://tracker.ceph.com/issues/48036

    ▌bluefs corruption just after restarting an OSD ▌Due to the lack of mutual exclusion in Rook ▌Fixed by ⚫https://github.com/rook/rook/pull/6793 17
  18. Additional information ▌We hit many of these bugs for the

    first time ▌Most bugs were detected in OSD on HDD ⚫HDD seems to be good for detecting these kinds of bugs ▌All bugs can be reproduced by stress tests ⚫Creating/restarting OSDs continuously 18
  19. Table of Contents 19 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  20. Questions ▌Why Cybozu hit these bugs for the first time?

    ▌Why Ceph’s official QA process hasn’t detected these bugs? 20
  21. My hypotheses ▌The number of OSD creation/restart in Cybozu is

    far larger than the official QA ▌The official QA hasn’t used HDD for OSD 21
  22. # of OSD creation/restart in the official QA ▌Original data

    ⚫The records of “pulpito.ceph.com” during Sep. 2022 ▌OSD creation: > 120000/month ⚫~30000 jobs * ~4 OSDs per tests ▌OSD restart: < 500/month ⚫Only happen in upgrade tests ⚫# of upgrade test cases is small ⚫Usually only runs before the new releases 22
  23. Comparing the numbers 23 ▌My hypotheses were… ⚫not correct about

    OSD creation ⚫correct about OSD restart Cybozu The Official QA # of OSD creation/month > 3000 > 120000 # of OSD restart/month > 3000 < 500
  24. Does official QA use HDD? ▌Many machines in Sepia Lab

    have both HDD and NVMe SSD ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw are:infrastructure ▌I couldn’t find whether the official QA uses HDD for OSD or not ⚫Please let me know if you know anything about it 24
  25. Table of Contents 25 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  26. Proposal(1/2) ▌Adding the stress tests about restarting OSD ▌It would

    be nice to detect bluefs corruption bugs just after restarting OSD ▌If we use 10 OSDs and restarting one OSD takes 3 minutes, # of restarting OSD is ~5000 per day ⚫Ref. 3000/month(Cybozu) and 500/month(the Official QA) 26
  27. Proposal(2/2) ▌If the official QA hasn’t used HDD for OSD,

    using them would improves the official QA ▌It also makes the stress test of restarting OSD better 27
  28. Table of Contents 28 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  29. Conclusion ▌The development of Cybozu’s modern containerized Ceph clusters revealed

    many bluefs corruption bugs ▌The key factors ⚫The frequency of creating/restarting OSDs ⚫OSD on HDD ▌The official QA would be better if ⚫Adding the stress test of restarting OSD ⚫Use HDD for OSD (if it hasn’t been used) 29
  30. Thank you! 30