Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Best Practices of Production-Grade Rook/Ceph Cluster

Best Practices of Production-Grade Rook/Ceph Cluster

My presentation slide at Cephalocon 2023
https://sched.co/1JKZf

Satoru Takeuchi
PRO

April 17, 2023
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Best Practices of Production-Grade
    Rook/Ceph Cluster
    Apr. 17th, 2023
    Cybozu, Inc.
    Satoru Takeuchi
    1

    View Slide

  2. Agenda
    2
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  3. Agenda
    3
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  4. About Cybozu
    ▌A leading cloud service provider in Japan
    ▌Providing Web services that supports teamwork
    4

    View Slide

  5. Our infrastructure
    ▌We have an on-premise infrastructure
    ▌Current system has many problems
    ⚫Not scalable
    ⚫A lot of day 2 work
    ▌Developing a new infrastructure
    ⚫Kubernetes(K8s)
    ⚫Ceph and Rook(Ceph orchestration in K8s)
    5

    View Slide

  6. Why Ceph?
    ▌Fulfill our requirements
    ⚫block & object storage
    ⚫Rack failure tolerance
    ⚫Bit-rot tolerance
    ▌Open source
    ⚫Detailed evaluation investigation of problems
    ⚫Use our custom container in case of emergency (see later)
    6

    View Slide

  7. Why Rook?
    ▌Manage storage system in k8s way as other
    system components
    ▌Offload a lot of work to K8s
    ⚫Lifecycle management of hardware
    ⚫Mon failover
    ⚫Restart problematic Ceph daemons
    7

    View Slide

  8. OSD(data)
    OSD(data)
    OSD(data)
    OSD(data)
    Our storage system
    8
    Ceph cluster for RGW Ceph cluster for RBD
    HDD
    SSD
    HDD
    HDD SSD
    NVMe SSD
    LogicalVolume
    LVM VolumeGroup
    OSD(data)
    LogicalVolume
    OSD(index) OSD
    Applications
    bucket RBD
    Applications
    Applications
    LogicalVolume
    node
    node
    • 3 replicas
    • Rack failure tolerance
    Rook
    manage

    View Slide

  9. Agenda
    9
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  10. What is K8s
    ▌A container orchestration
    ▌All services work as “pod”
    ⚫Pod is a set of containers
    10
    Kubernetes
    node
    Pod
    Container
    Container
    Pod Pod
    node node

    Pod
    Pod Pod
    Pod
    Pod Pod

    View Slide

  11. The concept of K8s
    ▌All configurations are described as resources
    ▌K8s keeps the desired states
    11
    Kubernetes
    Pod
    node
    Pod resource
    PersistenVolumeClaim
    (PVC) resource
    PersistenVolume
    (PV) resource
    volume
    10 GiB volume
    please!
    There is
    A 10 GiB volume!
    Provisioned by storage
    Provisioner (driver)

    View Slide

  12. ▌All Ceph components are described as resources
    ▌Rook keeps the desired state of Ceph clusters
    The concept of Rook
    12
    CephCluster Resource
    CephBlockPool resource
    Pod resource(MON) Pod resource(MGR)
    Pod resource(OSD)
    PV resource (for OSD) PVC resource (for OSD)
    MON pod MGR pod
    OSD pod
    disk
    node
    node
    node
    admin
    node
    Rook pod
    create watch create
    K8 cluster

    View Slide

  13. Agenda
    13
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  14. Our Rook/Ceph clusters
    ▌Requirements
    ⚫3 replicas
    ⚫Rack failure tolerance
    ⚫All OSDs should be spread evenly over all racks/nodes
    ▌Typical operations
    ⚫Create and upgrade clusters
    ⚫Manage OSDs (add a new one, replace a damaged one)
    ⚫etc.
    14

    View Slide

  15. ▌Just create the following resource
    15
    Create a Ceph cluster
    kind: CephCluster
    Metadata:
    name: ceph-ssd

    mgr:
    count: 2


    mon:
    count: 3

    storage:

    count: 3
    • All OSDs are evenly spread over all nodes and racks
    • Special configurations are necessary (see later)
    rack rack rack
    MON pod
    MGR pod
    OSD pod
    node node node
    node node node
    OSD pod OSD pod
    MON pod MON pod
    MGR pod

    View Slide

  16. Create an RBD pool
    ▌Just create the following resources
    16
    kind: CephBlockPool
    metadata:
    name: block-pool
    spec:
    replicated:
    size: 3
    failureDomain: zone
    kind: StorageClass
    metadata:
    name: ceph-block
    parameters:
    clusterID: ceph-ssd
    pool: block-pool
    csi.storage.k8s.io/fstype: ext4
    Ceph cluster
    A pool for rbd
    Name:block-pool
    Replicated: 3
    Failure domain: rack
    “zone” means “rack”
    in our clusters
    ceph-block
    provisioner

    View Slide

  17. Consume an RBD image
    ▌Just create the following resources
    17
    kind: PersistentVolumeClaim
    metadata:
    name: myclaim
    spec:

    storageClassName: ceph-block
    kind: Pod
    metadata:
    name: mypod
    volumes:
    - name: myvolume
    persistentVolumeClaim:
    claimName: myclaim
    Ceph cluster
    A pool for RBD
    Name: block-pool
    Replicas: 3
    Failure domain: rack
    RBD volume
    mypod
    create
    consume
    ceph-block
    provisioner

    View Slide

  18. Configurations
    ▌ Edit the “rook-config-override” ConfigMap resource
    corresponding to “/etc/ceph.conf”
    ⚫ Restarting Ceph pods is necessary after that
    ▌Some configurations (e.g. “ceph set”) can’t use this way
    ⚫ Run Ceph commands in the “toolbox” pod
    18
    kind: ConfigMap
    metadata:
    name: rook-config-override
    data:
    config: |
    debug rgw = 5/5

    View Slide

  19. ▌Just edit the CephCluster resource
    19
    Expand a cluster
    kind: CephCluster

    storage:

    count: 6
    Change from 3
    rack rack rack
    OSD pod
    node node node
    node node node
    OSD pod OSD pod
    OSD pod OSD pod
    OSD pod
    MON pod
    MGR pod
    MON pod MON pod
    MGR pod

    View Slide

  20. ▌Just run our home-made script
    ⚫There isn’t such the official job/script
    Replace a damaged OSD
    20
    rack rack rack
    OSD pod
    node node node
    node node node
    OSD pod OSD pod
    OSD pod OSD pod
    OSD pod
    MON pod
    MGR pod
    MON pod MON pod
    MGR pod
    OSD pod
    remove
    create
    A home-maid script

    View Slide

  21. Upgrade Rook and Ceph
    ▌Edit the CephCluster resource
    ▌All Ceph container images will be upgraded after that
    ▌Other work might be needed
    ⚫See files under “Documentation/Upgrade” of Rook repository
    21
    kind: CephCluster

    image: ceph/ceph:v17.2.6
    Change from “ceph/ceph:v17.2.5”

    View Slide

  22. Troubleshooting
    ▌Same as other Ceph orchestrations
    ⚫Running Ceph commands
    ⚫Referring logs and metrics
    ⚫Report bugs to upstream
    ▌Rook is convenient but is not a silver bullet
    ⚫Rook can’t debug and fix bugs
    22

    View Slide

  23. Advanced configurations
    ▌Even OSD deployment over all nodes/racks
    ▌Automatic OSD deployment
    23

    View Slide

  24. A subject about even OSD deployment
    ▌K8s deploy Pods to arbitrary nodes by default
    ⚫ OSDs might be unevenly spread
    24
    node node
    kind: CephCluster

    storage:

    count: 3
    One node lost results in data loss
    node
    OSD
    OSD OSD

    View Slide

  25. Solution
    ▌Use the “TopologySpreadConstraints” feature of K8s
    ⚫ Spread specific pods evenly over all nodes (or racks, and so on)
    25
    node
    kind: CephCluster

    storage:

    count: 3

    topologySpreadConstraints:
    - labelSelector:
    matchExpressions:
    - key: app
    operator: In
    values:
    - rook-ceph-osd
    - rook-ceph-osd-prepare
    topologyKey: topology.kubernetes.io/hostname
    OSD
    node
    OSD
    node
    OSD
    See also:
    https://blog.kintone.io/entry/2020/09/18/175030

    View Slide

  26. rack
    In our clusters
    ▌Two constraints for both nodes and racks
    26
    kind: CephCluster

    storage:

    count: 12

    topologySpreadConstraints:
    - labelSelector:

    topologyKey: topology.kubernetes.io/zone
    - labelSelector

    topologyKey: topology.kubernetes.io/hostname
    rack rack
    node
    OSD OSD
    node
    OSD
    node
    OSD
    node
    OSD
    node
    OSD
    node
    OSD
    OSD
    OSD
    OSD
    OSD
    OSD

    View Slide

  27. A subject about automatic OSD deployment
    ▌OSD creation flow
    1. Rook create a PVC resources for an OSD
    2. K8s bind a PV resource to this PVC resource
    3. Rook creates an OSD on top of the block device corresponding to the PV
    ▌It’s suspended if the PV resource isn’t available
    27
    OSD
    OSD
    Ceph cluster
    OSD
    A Storage provisioner
    Not provisioned yet
    kind: CephCluster

    storage:

    count: 3

    View Slide

  28. Solution
    ▌Use storage providers supporting dynamic provisioning
    ⚫e.g. many cloud storages, TopoLVM (for local volumes)
    ▌Provision PV resources beforehand
    ⚫e.g. local-static-provisioner
    28
    storage:

    count: 3
    volumeClaimTemplates:
    - spec:
    storageClassName: nice-provisioner
    OSD
    OSD
    Ceph cluster
    OSD
    Nice provisioner
    Provision on demand
    or pre-provisioned

    View Slide

  29. In our cluster
    ▌NVMe SSD
    ⚫Use TopoLVM
    ⚫PV resources and the corresponding LVM logical volumes are
    created on OSD creation
    ▌HDD
    ⚫Use Local persistent volume
    ⚫Create PVs for all HDDs on deploying nodes
    ⚫PV resources are already available on OSD creation
    29

    View Slide

  30. Agenda
    30
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  31. Daily check of upstream Rook/Ceph
    ▌Check every update of both Rook and Ceph
    projects every day
    ⚫Watch the important bugfix PRs and backport them if necessary
    ▌e.g. a data lost bug in RGW (PR#49795)
    ⚫Some object might be gone on bucket index resharding
    ⚫Pending resharding operation until the upgrade to >= v17.2.6
    31

    View Slide

  32. Upstream-first development
    ▌We’ve shared everything with Rook/Ceph communities
    ⚫ Reduce the long-term maintenance cost
    ⚫ Make both communities better
    ▌Major contributions
    ⚫ Implemented the Rook’s advanced configurations
    ⚫ I’ve been working as a Rook maintainer
    ⚫ Resolved some problems in containerized Ceph clusters
    ⚫ https://speakerdeck.com/sat/revealing-bluestore-corruption-bugs-in-containerized-ceph-clusters
    32

    View Slide

  33. Running the custom containers
    ▌If there is a critical bug and we can’t wait the next
    release, we want to use our custom containers
    ⚫ Official release + critical patches
    ▌Have being tried to run Teuthology in our test
    environment to verify custom containers
    ⚫ Succeeded in running all tests, but most tests still fail
    ⚫ We’ll continue to fix this problem
    33

    View Slide

  34. Other remaining work
    ▌Backup/restore
    ▌Remote Replication
    ▌More automation
    34

    View Slide

  35. Agenda
    35
    ▌Cybozu and our storage system
    ▌Quick introduction to K8s and Rook
    ▌Advanced Rook/Ceph cluster
    ▌Efforts and challenges
    ▌Conclusion

    View Slide

  36. Conclusion
    ▌Rook is an attractive option of Ceph orchestration
    ⚫Especially if you are familiar with K8s
    ▌There are some advanced configurations
    ▌We’ll continue to provide feedbacks to Rook/Ceph
    communities
    36

    View Slide

  37. Thank you!
    37

    View Slide