Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage

Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage

Satoru Takeuchi

December 07, 2022
Tweet

More Decks by Satoru Takeuchi

Other Decks in Technology

Transcript

  1. Requirements for local storage ▌Users can create arbitrary sized volumes

    ⚫Fixed size disks/partitions are inconvenient ▌Volumes should be spread over nodes based on free storage capacity ⚫Use storage capacity for each node evenly 6
  2. What was the best storage driver? ▌There was no CSI

    driver that met all our requirements ▌Decided to create a new CSI driver, TopoLVM 7
  3. Arbitrary volume size ▌TopoLVM deals with LVM VGs prepared on

    nodes ▌TopoLVM creates an LVM LV for each PV resource 9 Node0 VG Node1 VG LV LV LV K8s resources PV PV PV
  4. Pod scheduling and volume provisioning(1/2) 10 Node0 free: 50GiB Node1

    Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) “volumeBindingMode: WaitForFirstConsumer” (A PV is bound to a PVC at pod scheduling) Dynamic volume provisioning (A PV will be created at pod scheduling)
  5. Pod scheduling and volume provisioning(2/2) ▌The pod is scheduled to

    the node having the largest free VG space as possible (in this case, node1) ▌The volume is provisioned on the same node (node1) 11 Node1 free: 90GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) Pod PV size: 10GiB Node0 free: 50GiB Node2 free: 5GiB
  6. Other features ▌ext4, XFS, Btrfs, and Raw Block Volume ▌Generic

    ephemeral volume ▌Volume expansion ▌Thin volume ⚫With thin snapshot and thin clone 12
  7. Challenges ▌Schedule a pod to the node having as large

    free VG space as possible ⚫=> Scheduler extender ▌ Provision the volume on the same node ⚫=> CSI Topology 15
  8. Scheduler extender 16 1. Start pod scheduling 4. Pod is

    scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes scheduler extender(Webhook) 2.1 Filter out nodes which don’t have some conditions 3.1 Add a factor to scoring nodes
  9. TopoLVM’s scheduler extender 17 1. Start pod scheduling 4. Pod

    is scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes TopoLVM’s scheduler extender(Webhook) 2.1 Filter out nodes which don’t have enough free VG space 3.1 Add a factor of scoring to prefer nodes having large free VG space
  10. The parameters of the TopoLVM’s scheduler extender ▌TopoLVM’s scheduler extender

    requires two kinds of parameters ⚫Free VG space for each node(*1) ⚫Total requested TopoLVM volume size for each Pod ▌TopoLVM manages annotations for these parameters in node and pod resources 18 *1 K8s’s StorageCapacityTracking feature can also be used only for filtering
  11. CSI topology ▌A feature of Kubernetes ⚫https://kubernetes-csi.github.io/docs/topology.html ▌Schedule a pod

    to one of the nodes where its volumes are available ⚫Used for zone local storage, node local storage, and so on ▌TopoLVM create a volume on the same node as the corresponding pods. 19
  12. Example 20 Node0 free: 50GiB Node1 Node2 free: 100GiB free:

    5GiB K8s resources StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by TopoLVM
  13. Create both Pod and PVC resources 21 Node0 free: 50GiB

    Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by a TopoLVM’s webhook
  14. Scheduler extender: Filtering 22 Node0 free: 50GiB Node1 Node2 free:

    100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Not enough
  15. Scheduler extender: Scoring 23 Node0 free: 50GiB Node1 Node2 free:

    100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB 👍 The largest value Pod
  16. Provision and binding the volume 24 Node0 free: 50GiB Node1

    Node2 free: 90GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Pod PV size: 10GiB Will be updated later
  17. Next plans ▌Implement the K8s-official capacity- aware pod scheduling ⚫Setting

    up a scheduler extender is a bit difficult ⚫We’re preparing a KEP ▌Donate TopoLVM project to CNCF 26
  18. Conclusion ▌TopoLVM is an LVM-based CSI driver ▌Volumes and the

    corresponding pods are evenly spread for each node ⚫By scheduler extender and CSI topology ▌We welcome new users and contributions 27