Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

My presentation slides at Ceph Virtual
https://ceph.io/en/community/events/2022/ceph-virtual/

Satoru Takeuchi

November 12, 2022
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Table of Contents 2 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  2. Table of Contents 3 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  3. What is Cybozu ▌A leading cloud service provider in Japan

    ▌Providing software that supports teamwork 4
  4. Cybozu’s infrastructure ▌The current infrastructure is a traditional VM-based system

    ▌Developing a new modern containerized infrastructure ⚫Using Kubernetes ⚫Using Rook/Ceph as storage 5
  5. The characteristics of modern containerized systems ▌Easy to deploy/manage than

    traditional systems ▌Restart both nodes and containers frequently ▌Usually run integration tests per PR ⚫Create/restart containers during tests 6
  6. The characteristics of the development of our infrastructure ▌Create Ceph

    clusters frequently ▌Create and restart OSDs frequently 7
  7. Development strategy 1. Every change (PR) to our infrastructure kicks

    integration tests in VM-based test environments ⚫Each test environment emulates the whole data center having two Rook/Ceph clusters 2. If this test passed, apply changes to the on- premise staging system 3. If the staging system works fine, apply changes to the on-premise production system 8
  8. When we create/restart OSDs ▌For each integration test ⚫Creates two

    Rook/Ceph clusters (totally > 10 OSDs) ⚫Restarts all nodes (it implies all OSDs) ⚫This test runs > 10 times a day ▌Restart all nodes of the staging environment once per week ⚫To verify our infrastructure’s availability and update firmware ⚫The staging environment has about 100 OSDs 9
  9. # of OSD creation/restart ▌Estimated values from test/operation logs ⚫3000/month

    for both creating/restarting OSD ▌I believe that it’s a far larger number than traditional non-containerized Ceph clusters 10
  10. Table of Contents 11 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  11. The bugs we have revealed ▌3 bugs: failed to create

    a new OSD ▌2 bugs: bluefs corruption just after restarting an OSD 12
  12. List of bugs (1/5) ▌osd: failed to initialize OSD in

    Rook ⚫https://tracker.ceph.com/issues/51034 ▌Failed to prepare a new OSD ▌Due to the inconsistency of write I/O ▌Fixed by ⚫ https://github.com/ceph/ceph/pull/42424 13
  13. List of bugs (2/5) ▌OSD::mkfs: ObjectStore::mkfs failed with error (5)

    Input/output error ⚫https://tracker.ceph.com/issues/54019 ▌Failed to prepare a new OSD ▌Not fixed yet 14
  14. List of bugs (3/5) ▌failed to start new osd due

    to SIGSEGV in BlueStore::read() ⚫https://tracker.ceph.com/issues/53184 ▌Succeeded to prepare a new OSD but failed to start this OSD ▌Not fixed yet 15
  15. List of bugs (4/5) ▌rocksdb crushed due to checksum mismatch

    ⚫https://tracker.ceph.com/issues/57507 ▌Bluefs corruption just after restarting an OSD ▌Due to collision between BlueFS and BlueStore deferred write ▌Probably fixed by ⚫https://github.com/ceph/ceph/pull/46890 16
  16. List of bugs (5/5) ▌bluefs corrupted in a OSD ⚫https://tracker.ceph.com/issues/48036

    ▌bluefs corruption just after restarting an OSD ▌Due to the lack of mutual exclusion in Rook ▌Fixed by ⚫https://github.com/rook/rook/pull/6793 17
  17. Additional information ▌We hit many of these bugs for the

    first time ▌Most bugs were detected in OSD on HDD ⚫HDD seems to be good for detecting these kinds of bugs ▌All bugs can be reproduced by stress tests ⚫Creating/restarting OSDs continuously 18
  18. Table of Contents 19 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  19. Questions ▌Why Cybozu hit these bugs for the first time?

    ▌Why Ceph’s official QA process hasn’t detected these bugs? 20
  20. My hypotheses ▌The number of OSD creation/restart in Cybozu is

    far larger than the official QA ▌The official QA hasn’t used HDD for OSD 21
  21. # of OSD creation/restart in the official QA ▌Original data

    ⚫The records of “pulpito.ceph.com” during Sep. 2022 ▌OSD creation: > 120000/month ⚫~30000 jobs * ~4 OSDs per tests ▌OSD restart: < 500/month ⚫Only happen in upgrade tests ⚫# of upgrade test cases is small ⚫Usually only runs before the new releases 22
  22. Comparing the numbers 23 ▌My hypotheses were… ⚫not correct about

    OSD creation ⚫correct about OSD restart Cybozu The Official QA # of OSD creation/month > 3000 > 120000 # of OSD restart/month > 3000 < 500
  23. Does official QA use HDD? ▌Many machines in Sepia Lab

    have both HDD and NVMe SSD ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw are:infrastructure ▌I couldn’t find whether the official QA uses HDD for OSD or not ⚫Please let me know if you know anything about it 24
  24. Table of Contents 25 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  25. Proposal(1/2) ▌Adding the stress tests about restarting OSD ▌It would

    be nice to detect bluefs corruption bugs just after restarting OSD ▌If we use 10 OSDs and restarting one OSD takes 3 minutes, # of restarting OSD is ~5000 per day ⚫Ref. 3000/month(Cybozu) and 500/month(the Official QA) 26
  26. Proposal(2/2) ▌If the official QA hasn’t used HDD for OSD,

    using them would improves the official QA ▌It also makes the stress test of restarting OSD better 27
  27. Table of Contents 28 ▌Cybozu and our infrastructure ▌What kind

    of bugs we have revealed ▌Why we hit these bugs ▌Proposals to improve the official QA process ▌Conclusion
  28. Conclusion ▌The development of Cybozu’s modern containerized Ceph clusters revealed

    many bluefs corruption bugs ▌The key factors ⚫The frequency of creating/restarting OSDs ⚫OSD on HDD ▌The official QA would be better if ⚫Adding the stress test of restarting OSD ⚫Use HDD for OSD (if it hasn’t been used) 29