Worker Worker Worker Scheduler Cluster Manager Tight lifecycle coupling Today the Python process contains the cluster manager, which contains the only references to the resources created. This makes sense if the scheduler and workers are subprocesses on the same machine. But less so when they are remote and independent resources on a cloud or HPC platform. Python Process Remote Cluster resources
Worker Worker Worker Scheduler Tight lifecycle coupling What happens to the cluster resources if the Python process that created them is killed? Sometimes the OS may clean them up. Sometimes they will time out or hit a wall time. Sometimes they will exist until you manually delete them and cost you money. Cluster resources !??
Discovery The big challenge around managing cluster lifecycle is being able to discover existing clusters and then reconstruct the cluster manager which represents them. With dask-ctl cluster managers can register a discovery method via the dask_cluster_discovery entrypoint. To support dask-ctl cluster managers should implement this entrypoint and search for clusters (by listing jobs/pods/vm/etc looking for Dask cluster resources) and then return an iterable of cluster names and cluster managers which can recreate them.
Reconstruction Once clusters can be discovered then all cluster managers need to have a way of reconstructing the representation by the clusters name/uuid. In dask-ctl we try to call a ClusterManager.from_name(name/uuid) class method to reconstruct the cluster object. To support dask-ctl cluster managers also need to implement this method.
Lifecycle Once we can discover clusters and reconstruct cluster managers to represent then we can always perform lifecycle management operations including: ● Getting logs ● Connecting clients ● Scaling ● Deleting Create Scale Compute Delete
Next steps ● Implement dask-ctl support on as many cluster managers as possible. ● Add support for cluster discovery to the Dask Jupyter Lab Extension so that discovered clusters are listed in the sidebar. ● Stabilize things and move from dask-contrib org to dask.