Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Local Persistent Storage Update

Local Persistent Storage Update

Background and motivation, 1.7 + 1.8 alpha implementation, 1.9 planned features, future roadmap

Michelle Au

October 10, 2017
Tweet

More Decks by Michelle Au

Other Decks in Programming

Transcript

  1. Google Cloud Platform • Background and Motivation • 1.7 and

    1.8 Alpha Recap • 1.9 Planned Features • Future Features Agenda
  2. Google Cloud Platform Performance • Directly access local SSD disks

    • High performance networked storage requires supporting infrastructure and ops ($$$$) Hardware utilization • Increasing disk utilization reduces costs Why Local?
  3. Google Cloud Platform Downsides of local storage • Data availability

    and durability tied to node and storage • Application downtime if node or storage fails • Data loss if disk fails • Application placement not flexible • Need more resource planning and reservations Specialized use cases only • Distributed filesystems and datastores • Fault tolerant to node failures by maintaining data replicas • Ex: Cassandra, MongoDB, GlusterFS, Ceph, etc. • Caching • Persistence for avoiding cold restarts Use Cases
  4. Google Cloud Platform Scheduler not aware of local storage node

    constraints • Manually schedule individual pods, or build custom operator/scheduler Hostpath volumes have many downsides • Unmanaged volume lifecycle • Possible path collisions from multiple pods • Too many privileges: can specify any path in the system • Hostpath disabled by administrators • Spec not portable across clusters/environments • Build custom volume manager to allocate and reserve disks Current Challenges
  5. Google Cloud Platform High barrier to entry for performant stateful

    workloads • Need to build custom infrastructure to manage disks and schedule application • Many users choose not to run stateful workloads in Kubernetes • Cannot use Kubernetes to manage all workloads Reduced application developer velocity • More time building supporting infrastructure • Less time building application Cannot leverage Kubernetes features • Horizontal scaling • Advanced scheduling • Portability between clusters, environments, and storage providers Impact to Users
  6. Google Cloud Platform Expose local disks as Persistent Volumes •

    Enables portability across clusters, environments and storage providers • Consistent K8s volume management • Managed lifecycle • Managed access isolation between pods • Reduces user privileges Make scheduler aware of local disk placement constraints • Lowers barrier to entry: reduces need for custom operators and schedulers • Consistent application management across environments and storage providers • Integration with Kubernetes features • StatefulSets, horizontal scaling, pod affinity and anti-affinity, node taints and tolerations, etc. Project Goals
  7. Google Cloud Platform Proposals • Overview and vision: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/local- storage-overview.md

    • Detailed design: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/local- storage-pv.md Volume plugin: “local” volume • Can only be used as a Persistent Volume • Volume must already be formatted with a filesystem • Volume can be a mount point or shared directory 1.7 and 1.8 Alpha Recap
  8. Google Cloud Platform PV node affinity annotation and scheduler predicate

    • Like a node selector • Scheduler filters nodes by evaluating the PV’s node affinity against the node’s labels • Local PVs are reduced to a single node • Assumes PV is already bound External static provisioner for local volumes • NOT a dynamic provisioner • Manages the local disk lifecycle for pre-created volumes • Runs as a DaemonSet on every node • Discovers local volumes mounted under configurable directories • Automatically create, cleanup, and destroy local Persistent Volumes 1.7 and 1.8 Alpha Recap
  9. Google Cloud Platform E2E tests running in alpha CI suite

    • Directory-based and mount-based local volumes • Back-to-back pod mounts, simultaneous pod mounts • Local volume lifecycle with static provisioner • Negative cases with invalid path, invalid nodes 1.7 and 1.8 Alpha Recap
  10. Google Cloud Platform Persistent Volume binding happens before pod scheduling

    • Doesn’t consider pod scheduling requirements (ie, CPU, pod affinity, anti-affinity, taints, etc) • Cannot specify multiple local volumes in a single pod spec • Problem with zonal storage too Volume binding evaluates one Persistent Volume Claim at a time • Cannot specify multiple local volumes in a single pod spec • Problem with zonal storage too External provisioner cannot correctly detect volume capacity for new volumes mounted after provisioner has started • Needs mount propagation Alpha Limitations
  11. Google Cloud Platform Raw block local volumes (@dhirajh) • Design:

    https://github.com/kubernetes/community/pull/1140 • Plugin to implement Block interface • Static provisioner to discover block devices and create block PVs • Static provisioner to cleanup block devices with configurable cleanup methods Local provisioner security hardening (@verult) • Design: https://github.com/kubernetes/community/pull/1105 • Current design gives each provisioner pod PV create/delete permissions for all PVs, including those on other nodes • New design splits local provisioner into single master with PV permissions and DaemonSet workers that only report and operate on its local volumes 1.9 Planned Features
  12. Google Cloud Platform Volume topology-aware scheduling (@msau42) • Design: https://github.com/kubernetes/community/pull/1054

    • For all PVs, not just local volumes • PVC binding is delayed until a pod is scheduled • Decision of which PV to bind to is moved to the default scheduler • Scheduler initiates binding (PV prebind) and provisioning • Binding transaction and PV lifecycle management remains in the PV controller • Impact to custom schedulers, controllers, operators, etc. that use PVCs • Deprecation needs to be announced well ahead of time • Users can pull new binding library into their implementations 1.9 Planned Features
  13. Google Cloud Platform Volume topology-aware scheduling (cont.) • 1.9 Risk

    • Design issue: scheduler can be bypassed • Possible solutions need prototyping to determine viability and scope • A few ugly workarounds possible • Need to work with scheduler team to come up with a long term solution 1.9 Planned Features
  14. Google Cloud Platform Topology-aware dynamic provisioning • Provision storage in

    the correct node/zone where Pod wants to be scheduled • Local volume dynamic provisioning (LVM?) PV taints and tolerations • Give up (and lose!) my volume if it’s inaccessible or unhealthy • Local volume health monitoring Inline PV • PV with the lifetime of a Pod • Not enough root disk capacity for EmptyDir • Use dedicated local disk for IOPS isolation Future