Service (AIaaS): Zero to Kubeflow: Supporting Your Data Science Teams with the Most Common Uses Case s https://developer.nvidia.com/gtc/2020/video/s22040-vid CPU CPU CPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU (Excel, Internal Page) (Platform)
Components Scheduler Ref: GTC Digital October 2020: GPU-Accelerated Labs on Demand Using Virtual Computer Serve r https://www.nvidia.com/en-us/on-demand/session/gtcfall20-a21371/
• Agentless (nothing to install on target nodes ) • Easier to maintain & scale than custom script s • Playbooks use YAML: easy to learn and read Ref: https://www.ansible.com/
schedule jobs for GPUs on a node ( Excel spreadsheet) • Advanced container and workflow orchestration (e.g., with kubeflow ) • Covers data permissions and security (LDAP, file permissions ) • Adds analytics and monitoring (important also for justification of purchase) • Advanced scheduling features (batch / gang scheduling) • Works also without containers (no runtime overhead) • Multi-node job s • Job dependencies, workflows, DAGs • Advanced reservation s • Intelligent scheduling (not just FIFO ) • User accountin g • Other HPC-like scheduling functionality Schedulers Comparison Ref: GTC Silicon Valley-2019: Building and managing scalable AI infrastructure with NVIDIA DGX POD and DGX Pod Management software https://developer.nvidia.com/gtc/2019/video/s9334