Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to the CNCF Research User Group

Bob Killen
August 18, 2020

Intro to the CNCF Research User Group

Interested in improving the Research experience with Kubernetes, or simply running research workloads on it? The CNCF Research User Group’s purpose is to serve as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. Since the group’s inception 6 months ago, key areas have been identified as gaps within the ecosystem. This session would serve as an opportunity to share with a broader audience some of the key challenges the Research-user-group has identified, and showcase project updates on key tools that the research community is developing to address these challenges. For more information visit: https://github.com/cncf/research-user-group

Bob Killen

August 18, 2020
Tweet

More Decks by Bob Killen

Other Decks in Technology

Transcript

  1. Why • Increased use of containers...everywhere. • Increasingly complex workflows.

    • Adoption of data-streaming and in-flight processing. • Greater use of interactive Science Gateways. • Dependence on other more persistent services.
  2. Why form a user group? Most research oriented workloads are

    different from typical Enterprise workloads. ◦ Job/task focused (high rate of churn) ◦ Resource intensive ◦ Require more verbose scheduling (MPI) ◦ Multitenant environment ◦ Support for large or multiple clusters
  3. Why form a user group? Cloud Native is about tools

    but also about communities, we need a neutral place to share ◦ Success stories ◦ Pain points ◦ Post-mortem stories ◦ Good practices ◦ And most importantly New ideas!
  4. TL;DR “The CNCF Research User Group’s purpose is to function

    as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. This includes enumerating current practices, identifying gaps, and directing effort to improve the Research Cloud Computing ecosystem.”
  5. Common Themes • Lack of knowledge of “what’s out there”

    • No best practices for large shared environments • Base batch capabilities incomplete • Multi-cluster/Federation job support lacking • Multi-tenancy is problematic
  6. Current Initiatives Research Institution Survey Who is using Kubernetes for

    research? What type of workloads are they running? How have they deployed them? Index of resources and useful links “Awesome list” of research focused links Best practices for running research clusters Book on best practices Common topics and user stories Assemble best practices into an easily consumable resources Inspired from original initiative in OpenStack
  7. Cloud Native Research Workloads Book Cluster Operations How to run

    research clusters Sys Admins / Operators, Research Software Engineers, Research Support Staff Making Research Cloud Native How “to do” research in a cloud native way Researchers, Teachers, Students
  8. Cluster Operations • Configuration Management & Cluster Lifecycle • Management

    of Clusters on top of OpenStack • Integration with Parallel / HPC file systems and access management • Managing / Exposing GPUs • Large Scale Clusters (1000+ nodes)
  9. Configuration Management Best practices around provisioning and managing clusters. •

    What tooling works well for provisioning (metal^3, kubespray, cluster-api etc) • How is authn/authz handled • Base security policy for multi-user / multi-tenant (OPA?)
  10. OpenStack / Parallel File Systems OpenStack Large number of research

    sites are running OpenStack. • Use ironic for bare metal provisioning? • How to integrate with keystone? Parallel File System / HPC Integration with a classic POSIX environment is extremely difficult. • How do you translate API identity to posix identity? • Can a translation mechanism be developed thats portable across different workload types?
  11. GPUs / Large Clusters Managing / Exposing GPUs GPU usage

    is becoming widespread, but is still a pain point to manage. • How are GPUs and their features being exposed/consumed by researchers? Large Scale Clusters Large clusters require a significant amount of tuning. • What are the best practices when running a cluster with 1000 nodes? 5000 nodes?
  12. Making Research Cloud Native • Scalable JupyterHub • Kubeflow /

    ML Lifecycle • Batch / MPI • Workflows / Workflow Engines / GitOps • DataOps (Data Lifecycle Management)
  13. Scalable JupyterHub JupyterHub is one of the most commonly deployed

    Research Application on top of Kubernetes. • User/Identity Management • Image Management • Integration with classic POSIX environment (Kerberos)
  14. Kubeflow / Batch - MPI Kubeflow / ML Lifecycle Kubeflow

    has become the go-to for an open source solution to ML on top of Kubernetes. • Kubeflow has many tunables. What are the general best practices for deployment? • How do you make it multi-tenant? Batch / MPI Kubernetes does not have direct support for batch/MPI jobs. • Which tool(s) should be used and how? • How can it bring Kubernetes closer to regular “batch” computing?
  15. Workflow Engines / Data Ops Workflow Engines / GitOps There

    is a large ecosystem of workflow engines and tools. • What tools are best suited for research use? • How can GitOps be used effectively with research? DataOps Managing data in a “Cloud Native” environment is fundamentally different from an HPC environment. • Moving away from classic POSIX to object. • How should data be saved alongside an instance of a model?
  16. What we need from you • Are you working at

    a Research Organization? ◦ Fill out or site survey! • Have some useful links or resources? ◦ Add them to our list! • Are you a subject matter expert in one of the categories? ◦ Help work on the Cloud Native Research Workloads Book
  17. Where to find us • GitHub Repo: https://github.com/cncf/research-user-group • Mailing

    List: https://lists.cncf.io/g/cncf-research-user-group • Meetings: ◦ First Wednesday of the month at 11AM CET ◦ Third Wednesday of the month at 5PM CET