Intro to the CNCF Research User Group

CNCF Research User Group https://github.com/cncf/research-user-group Bob Killen - @mrbobbytables Eduardo
Arango - @carlosearango

Why • Increased use of containers...everywhere. • Increasingly complex workflows.
• Adoption of data-streaming and in-flight processing. • Greater use of interactive Science Gateways. • Dependence on other more persistent services.

Why form a user group? Most research oriented workloads are
different from typical Enterprise workloads. ◦ Job/task focused (high rate of churn) ◦ Resource intensive ◦ Require more verbose scheduling (MPI) ◦ Multitenant environment ◦ Support for large or multiple clusters

Why form a user group? Cloud Native is about tools
but also about communities, we need a neutral place to share ◦ Success stories ◦ Pain points ◦ Post-mortem stories ◦ Good practices ◦ And most importantly New ideas!

Who’s involved

TL;DR “The CNCF Research User Group’s purpose is to function
as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. This includes enumerating current practices, identifying gaps, and directing effort to improve the Research Cloud Computing ecosystem.”

Common Themes • Lack of knowledge of “what’s out there”
• No best practices for large shared environments • Base batch capabilities incomplete • Multi-cluster/Federation job support lacking • Multi-tenancy is problematic

Initiatives

Current Initiatives Research Institution Survey Who is using Kubernetes for
research? What type of workloads are they running? How have they deployed them? Index of resources and useful links “Awesome list” of research focused links Best practices for running research clusters Book on best practices Common topics and user stories Assemble best practices into an easily consumable resources Inspired from original initiative in OpenStack

Cloud Native Research Workloads Book Cluster Operations How to run
research clusters Sys Admins / Operators, Research Software Engineers, Research Support Staff Making Research Cloud Native How “to do” research in a cloud native way Researchers, Teachers, Students

Cluster Operations • Configuration Management & Cluster Lifecycle • Management
of Clusters on top of OpenStack • Integration with Parallel / HPC file systems and access management • Managing / Exposing GPUs • Large Scale Clusters (1000+ nodes)

Initiatives Cloud Native Workloads Book: Cluster Operations

Configuration Management Best practices around provisioning and managing clusters. •
What tooling works well for provisioning (metal^3, kubespray, cluster-api etc) • How is authn/authz handled • Base security policy for multi-user / multi-tenant (OPA?)

OpenStack / Parallel File Systems OpenStack Large number of research
sites are running OpenStack. • Use ironic for bare metal provisioning? • How to integrate with keystone? Parallel File System / HPC Integration with a classic POSIX environment is extremely difficult. • How do you translate API identity to posix identity? • Can a translation mechanism be developed thats portable across different workload types?

GPUs / Large Clusters Managing / Exposing GPUs GPU usage
is becoming widespread, but is still a pain point to manage. • How are GPUs and their features being exposed/consumed by researchers? Large Scale Clusters Large clusters require a significant amount of tuning. • What are the best practices when running a cluster with 1000 nodes? 5000 nodes?

Initiatives Cloud Native Workloads Book: Making Research Cloud Native

Making Research Cloud Native • Scalable JupyterHub • Kubeflow /
ML Lifecycle • Batch / MPI • Workflows / Workflow Engines / GitOps • DataOps (Data Lifecycle Management)

Scalable JupyterHub JupyterHub is one of the most commonly deployed
Research Application on top of Kubernetes. • User/Identity Management • Image Management • Integration with classic POSIX environment (Kerberos)

Kubeflow / Batch - MPI Kubeflow / ML Lifecycle Kubeflow
has become the go-to for an open source solution to ML on top of Kubernetes. • Kubeflow has many tunables. What are the general best practices for deployment? • How do you make it multi-tenant? Batch / MPI Kubernetes does not have direct support for batch/MPI jobs. • Which tool(s) should be used and how? • How can it bring Kubernetes closer to regular “batch” computing?

Workflow Engines / Data Ops Workflow Engines / GitOps There
is a large ecosystem of workflow engines and tools. • What tools are best suited for research use? • How can GitOps be used effectively with research? DataOps Managing data in a “Cloud Native” environment is fundamentally different from an HPC environment. • Moving away from classic POSIX to object. • How should data be saved alongside an instance of a model?

What we need from you • Are you working at
a Research Organization? ◦ Fill out or site survey! • Have some useful links or resources? ◦ Add them to our list! • Are you a subject matter expert in one of the categories? ◦ Help work on the Cloud Native Research Workloads Book

Where to find us • GitHub Repo: https://github.com/cncf/research-user-group • Mailing
List: https://lists.cncf.io/g/cncf-research-user-group • Meetings: ◦ First Wednesday of the month at 11AM CET ◦ Third Wednesday of the month at 5PM CET

Intro to the CNCF Research User Group

Intro to the CNCF Research User Group

Bob Killen

More Decks by Bob Killen

Other Decks in Technology

Featured

Transcript

CNCF Research User Group https://github.com/cncf/research-user-group Bob Killen - @mrbobbytables Eduardo

Why?

Why • Increased use of containers...everywhere. • Increasingly complex workflows.

Why form a user group? Most research oriented workloads are

Why form a user group? Cloud Native is about tools

Who’s involved

TL;DR “The CNCF Research User Group’s purpose is to function

Common Themes • Lack of knowledge of “what’s out there”

Initiatives

Current Initiatives Research Institution Survey Who is using Kubernetes for

Cloud Native Research Workloads Book Cluster Operations How to run

Cluster Operations • Configuration Management & Cluster Lifecycle • Management

Initiatives Cloud Native Workloads Book: Cluster Operations

Configuration Management Best practices around provisioning and managing clusters. •

OpenStack / Parallel File Systems OpenStack Large number of research

GPUs / Large Clusters Managing / Exposing GPUs GPU usage

Initiatives Cloud Native Workloads Book: Making Research Cloud Native

Making Research Cloud Native • Scalable JupyterHub • Kubeflow /

Scalable JupyterHub JupyterHub is one of the most commonly deployed

Kubeflow / Batch - MPI Kubeflow / ML Lifecycle Kubeflow

Workflow Engines / Data Ops Workflow Engines / GitOps There

What we need from you • Are you working at

Where to find us • GitHub Repo: https://github.com/cncf/research-user-group • Mailing