Slide 1

Slide 1 text

Capacity-aware Dynamic Volume Provisioning For LVM-based Local Storage Dec. 7th, 2022 Cybozu, Inc. Satoru Takeuchi 1

Slide 2

Slide 2 text

Agenda 2 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s next

Slide 3

Slide 3 text

Agenda 3 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s next

Slide 4

Slide 4 text

About Cybozu ▌A leading cloud service provider in Japan ▌Providing software that supports teamwork 4

Slide 5

Slide 5 text

Cybozu’s Kubernetes cluster ▌On-premises K8s cluster ▌Storage ⚫Distributed Block&Object Storage ⚫=> Rook/Ceph ⚫Local fast(NVMe SSD) storage ⚫=> ??? 5

Slide 6

Slide 6 text

Requirements for local storage ▌Users can create arbitrary sized volumes ⚫Fixed size disks/partitions are inconvenient ▌Volumes should be spread over nodes based on free storage capacity ⚫Use storage capacity for each node evenly 6

Slide 7

Slide 7 text

What was the best storage driver? ▌There was no CSI driver that met all our requirements ▌Decided to create a new CSI driver, TopoLVM 7

Slide 8

Slide 8 text

Agenda 8 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s next

Slide 9

Slide 9 text

Arbitrary volume size ▌TopoLVM deals with LVM VGs prepared on nodes ▌TopoLVM creates an LVM LV for each PV resource 9 Node0 VG Node1 VG LV LV LV K8s resources PV PV PV

Slide 10

Slide 10 text

Pod scheduling and volume provisioning(1/2) 10 Node0 free: 50GiB Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) “volumeBindingMode: WaitForFirstConsumer” (A PV is bound to a PVC at pod scheduling) Dynamic volume provisioning (A PV will be created at pod scheduling)

Slide 11

Slide 11 text

Pod scheduling and volume provisioning(2/2) ▌The pod is scheduled to the node having the largest free VG space as possible (in this case, node1) ▌The volume is provisioned on the same node (node1) 11 Node1 free: 90GiB K8s resources PVC 10GiB Pod StorageClass (TopoLVM) Pod PV size: 10GiB Node0 free: 50GiB Node2 free: 5GiB

Slide 12

Slide 12 text

Other features ▌ext4, XFS, Btrfs, and Raw Block Volume ▌Generic ephemeral volume ▌Volume expansion ▌Thin volume ⚫With thin snapshot and thin clone 12

Slide 13

Slide 13 text

Community ▌There are many non-Cybozu users/developers ▌Some companies use TopoLVM in their products 13

Slide 14

Slide 14 text

Agenda 14 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s next

Slide 15

Slide 15 text

Challenges ▌Schedule a pod to the node having as large free VG space as possible ⚫=> Scheduler extender ▌ Provision the volume on the same node ⚫=> CSI Topology 15

Slide 16

Slide 16 text

Scheduler extender 16 1. Start pod scheduling 4. Pod is scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes scheduler extender(Webhook) 2.1 Filter out nodes which don’t have some conditions 3.1 Add a factor to scoring nodes

Slide 17

Slide 17 text

TopoLVM’s scheduler extender 17 1. Start pod scheduling 4. Pod is scheduled to a node that has the highest score 2. Filter out nodes that doesn’t match conditions 3. Scoring the remaining nodes TopoLVM’s scheduler extender(Webhook) 2.1 Filter out nodes which don’t have enough free VG space 3.1 Add a factor of scoring to prefer nodes having large free VG space

Slide 18

Slide 18 text

The parameters of the TopoLVM’s scheduler extender ▌TopoLVM’s scheduler extender requires two kinds of parameters ⚫Free VG space for each node(*1) ⚫Total requested TopoLVM volume size for each Pod ▌TopoLVM manages annotations for these parameters in node and pod resources 18 *1 K8s’s StorageCapacityTracking feature can also be used only for filtering

Slide 19

Slide 19 text

CSI topology ▌A feature of Kubernetes ⚫https://kubernetes-csi.github.io/docs/topology.html ▌Schedule a pod to one of the nodes where its volumes are available ⚫Used for zone local storage, node local storage, and so on ▌TopoLVM create a volume on the same node as the corresponding pods. 19

Slide 20

Slide 20 text

Example 20 Node0 free: 50GiB Node1 Node2 free: 100GiB free: 5GiB K8s resources StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by TopoLVM

Slide 21

Slide 21 text

Create both Pod and PVC resources 21 Node0 free: 50GiB Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Added by a TopoLVM’s webhook

Slide 22

Slide 22 text

Scheduler extender: Filtering 22 Node0 free: 50GiB Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Not enough

Slide 23

Slide 23 text

Scheduler extender: Scoring 23 Node0 free: 50GiB Node1 Node2 free: 100GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB 👍 The largest value Pod

Slide 24

Slide 24 text

Provision and binding the volume 24 Node0 free: 50GiB Node1 Node2 free: 90GiB free: 5GiB K8s resources PVC 10GiB Pod Capacity: 10GiB StorageClass (TopoLVM) Node0 Capacity: 50GiB Node1 Capacity: 100GiB Node2 Capacity: 5GiB Pod PV size: 10GiB Will be updated later

Slide 25

Slide 25 text

Agenda 25 ▌Motivation ▌What is TopoLVM ▌How TopoLVM works ▌What’s next

Slide 26

Slide 26 text

Next plans ▌Implement the K8s-official capacity- aware pod scheduling ⚫Setting up a scheduler extender is a bit difficult ⚫We’re preparing a KEP ▌Donate TopoLVM project to CNCF 26

Slide 27

Slide 27 text

Conclusion ▌TopoLVM is an LVM-based CSI driver ▌Volumes and the corresponding pods are evenly spread for each node ⚫By scheduler extender and CSI topology ▌We welcome new users and contributions 27

Slide 28

Slide 28 text

That’s all, thank you! ▌Project page ⚫https://github.com/topolvm/topolvm ▌A blog post about TopoLVM ⚫https://blog.kintone.io/entry/topolvm 28