Lightning talk - Managing cluster lifecycle with dask-ctl

Lightning Talk Managing cluster lifecycle with dask-ctl Jacob Tomlinson

github.com/dask-contrib/dask-ctl A set of tools to provide a control plane
for managing the lifecycle of Dask clusters.

Worker Worker Worker Scheduler Cluster Manager Tight lifecycle coupling Today
the Python process contains the cluster manager, which contains the only references to the resources created. This makes sense if the scheduler and workers are subprocesses on the same machine. But less so when they are remote and independent resources on a cloud or HPC platform. Python Process Remote Cluster resources

Worker Worker Worker Scheduler Tight lifecycle coupling What happens to
the cluster resources if the Python process that created them is killed? Sometimes the OS may clean them up. Sometimes they will time out or hit a wall time. Sometimes they will exist until you manually delete them and cost you money. Cluster resources !??

Forcibly restarting your notebook kernel shouldn’t leave cluster resources on
some cloud or other platform that need to be killed manually.

Discovery The big challenge around managing cluster lifecycle is being
able to discover existing clusters and then reconstruct the cluster manager which represents them. With dask-ctl cluster managers can register a discovery method via the dask_cluster_discovery entrypoint. To support dask-ctl cluster managers should implement this entrypoint and search for clusters (by listing jobs/pods/vm/etc looking for Dask cluster resources) and then return an iterable of cluster names and cluster managers which can recreate them.

Reconstruction Once clusters can be discovered then all cluster managers
need to have a way of reconstructing the representation by the clusters name/uuid. In dask-ctl we try to call a ClusterManager.from_name(name/uuid) class method to reconstruct the cluster object. To support dask-ctl cluster managers also need to implement this method.

Lifecycle Once we can discover clusters and reconstruct cluster managers
to represent then we can always perform lifecycle management operations including: • Getting logs • Connecting clients • Scaling • Deleting Create Scale Compute Delete

Utilities Python CLI

Next steps • Implement dask-ctl support on as many cluster
managers as possible. • Add support for cluster discovery to the Dask Jupyter Lab Extension so that discovered clusters are listed in the sidebar. • Stabilize things and move from dask-contrib org to dask.

Lightning Talk Thank you! @_jacobtomlinson github.com/dask-contrib/dask-ctl dask-ctl.readthedocs.io

Lightning talk - Managing cluster lifecycle wit...

Lightning talk - Managing cluster lifecycle with dask-ctl

Jacob Tomlinson

More Decks by Jacob Tomlinson

Other Decks in Technology

Featured

Transcript