Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

Revealing BlueStore Corruption Bugs in Containerized Ceph Clusters

My presentation slides at Ceph Virtual
https://ceph.io/en/community/events/2022/ceph-virtual/

Satoru Takeuchi
PRO

November 12, 2022
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Revealing BlueStore Corruption Bugs
    in Containerized Ceph Clusters
    Nov. 11st, 2022
    Cybozu, Inc.
    Satoru Takeuchi
    1

    View Slide

  2. Table of Contents
    2
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  3. Table of Contents
    3
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  4. What is Cybozu
    ▌A leading cloud service provider in
    Japan
    ▌Providing software that supports
    teamwork
    4

    View Slide

  5. Cybozu’s infrastructure
    ▌The current infrastructure is a
    traditional VM-based system
    ▌Developing a new modern
    containerized infrastructure
    ⚫Using Kubernetes
    ⚫Using Rook/Ceph as storage
    5

    View Slide

  6. The characteristics of modern
    containerized systems
    ▌Easy to deploy/manage than traditional
    systems
    ▌Restart both nodes and containers
    frequently
    ▌Usually run integration tests per PR
    ⚫Create/restart containers during tests
    6

    View Slide

  7. The characteristics of the
    development of our infrastructure
    ▌Create Ceph clusters frequently
    ▌Create and restart OSDs frequently
    7

    View Slide

  8. Development strategy
    1. Every change (PR) to our infrastructure kicks
    integration tests in VM-based test environments
    ⚫Each test environment emulates the whole data center
    having two Rook/Ceph clusters
    2. If this test passed, apply changes to the on-
    premise staging system
    3. If the staging system works fine, apply changes
    to the on-premise production system
    8

    View Slide

  9. When we create/restart OSDs
    ▌For each integration test
    ⚫Creates two Rook/Ceph clusters (totally > 10 OSDs)
    ⚫Restarts all nodes (it implies all OSDs)
    ⚫This test runs > 10 times a day
    ▌Restart all nodes of the staging environment once
    per week
    ⚫To verify our infrastructure’s availability and update
    firmware
    ⚫The staging environment has about 100 OSDs
    9

    View Slide

  10. # of OSD creation/restart
    ▌Estimated values from test/operation logs
    ⚫3000/month for both creating/restarting OSD
    ▌I believe that it’s a far larger number than
    traditional non-containerized Ceph clusters
    10

    View Slide

  11. Table of Contents
    11
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  12. The bugs we have revealed
    ▌3 bugs: failed to create a new OSD
    ▌2 bugs: bluefs corruption just after
    restarting an OSD
    12

    View Slide

  13. List of bugs (1/5)
    ▌osd: failed to initialize OSD in Rook
    ⚫https://tracker.ceph.com/issues/51034
    ▌Failed to prepare a new OSD
    ▌Due to the inconsistency of write I/O
    ▌Fixed by
    ⚫ https://github.com/ceph/ceph/pull/42424
    13

    View Slide

  14. List of bugs (2/5)
    ▌OSD::mkfs: ObjectStore::mkfs failed
    with error (5) Input/output error
    ⚫https://tracker.ceph.com/issues/54019
    ▌Failed to prepare a new OSD
    ▌Not fixed yet
    14

    View Slide

  15. List of bugs (3/5)
    ▌failed to start new osd due to SIGSEGV
    in BlueStore::read()
    ⚫https://tracker.ceph.com/issues/53184
    ▌Succeeded to prepare a new OSD
    but failed to start this OSD
    ▌Not fixed yet
    15

    View Slide

  16. List of bugs (4/5)
    ▌rocksdb crushed due to checksum mismatch
    ⚫https://tracker.ceph.com/issues/57507
    ▌Bluefs corruption just after restarting an OSD
    ▌Due to collision between BlueFS and BlueStore
    deferred write
    ▌Probably fixed by
    ⚫https://github.com/ceph/ceph/pull/46890
    16

    View Slide

  17. List of bugs (5/5)
    ▌bluefs corrupted in a OSD
    ⚫https://tracker.ceph.com/issues/48036
    ▌bluefs corruption just after restarting an OSD
    ▌Due to the lack of mutual exclusion in
    Rook
    ▌Fixed by
    ⚫https://github.com/rook/rook/pull/6793
    17

    View Slide

  18. Additional information
    ▌We hit many of these bugs for the first
    time
    ▌Most bugs were detected in OSD on HDD
    ⚫HDD seems to be good for detecting these
    kinds of bugs
    ▌All bugs can be reproduced by stress tests
    ⚫Creating/restarting OSDs continuously
    18

    View Slide

  19. Table of Contents
    19
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  20. Questions
    ▌Why Cybozu hit these bugs for the first
    time?
    ▌Why Ceph’s official QA process hasn’t
    detected these bugs?
    20

    View Slide

  21. My hypotheses
    ▌The number of OSD creation/restart in
    Cybozu is far larger than the official QA
    ▌The official QA hasn’t used HDD for
    OSD
    21

    View Slide

  22. # of OSD creation/restart in the
    official QA
    ▌Original data
    ⚫The records of “pulpito.ceph.com” during Sep. 2022
    ▌OSD creation: > 120000/month
    ⚫~30000 jobs * ~4 OSDs per tests
    ▌OSD restart: < 500/month
    ⚫Only happen in upgrade tests
    ⚫# of upgrade test cases is small
    ⚫Usually only runs before the new releases
    22

    View Slide

  23. Comparing the numbers
    23
    ▌My hypotheses were…
    ⚫not correct about OSD creation
    ⚫correct about OSD restart
    Cybozu The Official QA
    # of OSD creation/month > 3000 > 120000
    # of OSD restart/month > 3000 < 500

    View Slide

  24. Does official QA use HDD?
    ▌Many machines in Sepia Lab have both HDD
    and NVMe SSD
    ⚫https://wiki.sepia.ceph.com/doku.php?id=hardw
    are:infrastructure
    ▌I couldn’t find whether the official QA uses
    HDD for OSD or not
    ⚫Please let me know if you know anything about it
    24

    View Slide

  25. Table of Contents
    25
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  26. Proposal(1/2)
    ▌Adding the stress tests about restarting OSD
    ▌It would be nice to detect bluefs corruption
    bugs just after restarting OSD
    ▌If we use 10 OSDs and restarting one OSD
    takes 3 minutes, # of restarting OSD is ~5000
    per day
    ⚫Ref. 3000/month(Cybozu) and 500/month(the
    Official QA)
    26

    View Slide

  27. Proposal(2/2)
    ▌If the official QA hasn’t used HDD for
    OSD, using them would improves the
    official QA
    ▌It also makes the stress test of
    restarting OSD better
    27

    View Slide

  28. Table of Contents
    28
    ▌Cybozu and our infrastructure
    ▌What kind of bugs we have revealed
    ▌Why we hit these bugs
    ▌Proposals to improve the official QA process
    ▌Conclusion

    View Slide

  29. Conclusion
    ▌The development of Cybozu’s modern containerized
    Ceph clusters revealed many bluefs corruption bugs
    ▌The key factors
    ⚫The frequency of creating/restarting OSDs
    ⚫OSD on HDD
    ▌The official QA would be better if
    ⚫Adding the stress test of restarting OSD
    ⚫Use HDD for OSD (if it hasn’t been used)
    29

    View Slide

  30. Thank you!
    30

    View Slide