Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage

Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage

Satoru Takeuchi
PRO

December 07, 2022
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage Dec. 7th,

    2022 Cybozu, Inc. Satoru Takeuchi 1
  2. Agenda 2 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s

    next
  3. Agenda 3 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s

    next
  4. About Cybozu ▌A leading cloud service provider in Japan ▌Providing

    software that supports teamwork 4
  5. Cybozu’s Kubernetes cluster ▌On-premises K8s cluster ▌Storage ⚫Distributed Block&Object Storage

    ⚫=> Rook/Ceph ⚫Local fast(NVMe SSD) storage ⚫=> ??? 5
  6. Requirements for local storage ▌Users can create arbitrary sized volumes

    ⚫Fixed size disks/partitions are inconvenient ▌Volumes should be spread over nodes based on free storage capacity ⚫Use storage capacity for each node evenly 6
  7. What was the best storage driver? ▌There was no CSI

    driver that met all our requirements ▌Decided to create a new CSI driver, TopoLVM 7
  8. Agenda 8 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s

    next
  9. Arbitrary volume size ▌TopoLVM deals with LVM VGs prepared on

    nodes ▌TopoLVM creates an LVM LV for each PV resource 9 Node0 VG Node1 VG LV LV LV K8s resources PV PV PV
  10. Pod scheduling and volume provisioning(1/2) 10 Node0 free: 50GiB Node1

    Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) “volumeBindingMode: WaitForFirstConsumer” (A PV is bound to a PVC at pod scheduling) Dynamic volume provisioning (A PV will be created at pod scheduling)
  11. Pod scheduling and volume provisioning(2/2) ▌The pod is scheduled to

    the node having the largest free VG space as possible (in this case, node1) ▌The volume is provisioned on the same node (node1) 11 Node1 free: 90GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) Pod PV size: 10GiB Node0 free: 50GiB Node2 free: 5GiB
  12. Other features ▌ext4, XFS, Btrfs, and Raw Block Volume ▌Generic

    ephemeral volume ▌Volume expansion ▌Thin volume ⚫With thin snapshot and thin clone 12
  13. Community ▌There are many non-Cybozu users/developers ▌Some companies use TopoLVM

    in their products 13
  14. Agenda 14 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s

    next
  15. Challenges ▌Schedule a pod to the node having as large

    free VG space as possible ⚫=> Scheduler extender ▌ Provision the volume on the same node ⚫=> CSI Topology 15
  16. Scheduler extender 16 1. Start pod scheduling 4. Pod is

    scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes scheduler extender(Webhook) 2.1 Filter out nodes which don’t have some conditions 3.1 Add a factor to scoring nodes
  17. TopoLVM’s scheduler extender 17 1. Start pod scheduling 4. Pod

    is scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes TopoLVM’s scheduler extender(Webhook) 2.1 Filter out nodes which don’t have enough free VG space 3.1 Add a factor of scoring to prefer nodes having large free VG space
  18. The parameters of the TopoLVM’s scheduler extender ▌TopoLVM’s scheduler extender

    requires two kinds of parameters ⚫Free VG space for each node(*1) ⚫Total requested TopoLVM volume size for each Pod ▌TopoLVM manages annotations for these parameters in node and pod resources 18 *1 K8s’s StorageCapacityTracking feature can also be used only for filtering
  19. CSI topology ▌A feature of Kubernetes ⚫https://kubernetes-csi.github.io/docs/topology.html ▌Schedule a pod

    to one of the nodes where its volumes are available ⚫Used for zone local storage, node local storage, and so on ▌TopoLVM create a volume on the same node as the corresponding pods. 19
  20. Example 20 Node0 free: 50GiB Node1 Node2 free: 100GiB free:

    5GiB K8s resources StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by TopoLVM
  21. Create both Pod and PVC resources 21 Node0 free: 50GiB

    Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by a TopoLVM’s webhook
  22. Scheduler extender: Filtering 22 Node0 free: 50GiB Node1 Node2 free:

    100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Not enough
  23. Scheduler extender: Scoring 23 Node0 free: 50GiB Node1 Node2 free:

    100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB 👍 The largest value Pod
  24. Provision and binding the volume 24 Node0 free: 50GiB Node1

    Node2 free: 90GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Pod PV size: 10GiB Will be updated later
  25. Agenda 25 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s

    next
  26. Next plans ▌Implement the K8s-official capacity- aware pod scheduling ⚫Setting

    up a scheduler extender is a bit difficult ⚫We’re preparing a KEP ▌Donate TopoLVM project to CNCF 26
  27. Conclusion ▌TopoLVM is an LVM-based CSI driver ▌Volumes and the

    corresponding pods are evenly spread for each node ⚫By scheduler extender and CSI topology ▌We welcome new users and contributions 27
  28. That’s all, thank you! ▌Project page ⚫https://github.com/topolvm/topolvm ▌A blog post

    about TopoLVM ⚫https://blog.kintone.io/entry/topolvm 28