$30 off During Our Annual Pro Sale. View Details »

Lightning talk - Managing cluster lifecycle with dask-ctl

Lightning talk - Managing cluster lifecycle with dask-ctl

Jacob Tomlinson

May 20, 2021
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Lightning Talk
    Managing cluster lifecycle with dask-ctl
    Jacob Tomlinson

    View Slide

  2. github.com/dask-contrib/dask-ctl
    A set of tools to provide a control plane for managing the
    lifecycle of Dask clusters.

    View Slide

  3. Worker Worker Worker
    Scheduler
    Cluster
    Manager
    Tight lifecycle
    coupling
    Today the Python process
    contains the cluster manager,
    which contains the only
    references to the resources
    created.
    This makes sense if the
    scheduler and workers are
    subprocesses on the same
    machine.
    But less so when they are
    remote and independent
    resources on a cloud or HPC
    platform.
    Python Process
    Remote Cluster resources

    View Slide

  4. Worker Worker Worker
    Scheduler
    Tight lifecycle
    coupling
    What happens to the cluster
    resources if the Python
    process that created them is
    killed?
    Sometimes the OS may clean
    them up.
    Sometimes they will time out or
    hit a wall time.
    Sometimes they will exist until
    you manually delete them and
    cost you money.
    Cluster resources
    !??

    View Slide

  5. Forcibly restarting your notebook kernel shouldn’t leave cluster resources on
    some cloud or other platform that need to be killed manually.

    View Slide

  6. Discovery
    The big challenge around managing cluster lifecycle is being able to discover
    existing clusters and then reconstruct the cluster manager which represents
    them.
    With dask-ctl cluster managers can register a discovery method via the
    dask_cluster_discovery entrypoint.
    To support dask-ctl cluster managers should implement this entrypoint and
    search for clusters (by listing jobs/pods/vm/etc looking for Dask cluster
    resources) and then return an iterable of cluster names and cluster
    managers which can recreate them.

    View Slide

  7. Reconstruction
    Once clusters can be discovered then all cluster managers need to have a
    way of reconstructing the representation by the clusters name/uuid.
    In dask-ctl we try to call a ClusterManager.from_name(name/uuid) class
    method to reconstruct the cluster object.
    To support dask-ctl cluster managers also need to implement this method.

    View Slide

  8. Lifecycle
    Once we can discover clusters and reconstruct cluster managers to
    represent then we can always perform lifecycle management operations
    including:
    ● Getting logs
    ● Connecting clients
    ● Scaling
    ● Deleting
    Create
    Scale
    Compute
    Delete

    View Slide

  9. Utilities
    Python
    CLI

    View Slide

  10. Next steps
    ● Implement dask-ctl support on as many cluster
    managers as possible.
    ● Add support for cluster discovery to the Dask Jupyter
    Lab Extension so that discovered clusters are listed in
    the sidebar.
    ● Stabilize things and move from dask-contrib org to
    dask.

    View Slide

  11. Lightning Talk
    Thank you!
    @_jacobtomlinson
    github.com/dask-contrib/dask-ctl
    dask-ctl.readthedocs.io

    View Slide