Slide 1

Slide 1 text

CNCF Research User Group https://github.com/cncf/research-user-group Bob Killen - @mrbobbytables Eduardo Arango - @carlosearango

Slide 2

Slide 2 text

Why?

Slide 3

Slide 3 text

Why • Increased use of containers...everywhere. • Increasingly complex workflows. • Adoption of data-streaming and in-flight processing. • Greater use of interactive Science Gateways. • Dependence on other more persistent services.

Slide 4

Slide 4 text

Why form a user group? Most research oriented workloads are different from typical Enterprise workloads. ○ Job/task focused (high rate of churn) ○ Resource intensive ○ Require more verbose scheduling (MPI) ○ Multitenant environment ○ Support for large or multiple clusters

Slide 5

Slide 5 text

Why form a user group? Cloud Native is about tools but also about communities, we need a neutral place to share ○ Success stories ○ Pain points ○ Post-mortem stories ○ Good practices ○ And most importantly New ideas!

Slide 6

Slide 6 text

Who’s involved

Slide 7

Slide 7 text

TL;DR “The CNCF Research User Group’s purpose is to function as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. This includes enumerating current practices, identifying gaps, and directing effort to improve the Research Cloud Computing ecosystem.”

Slide 8

Slide 8 text

Common Themes ● Lack of knowledge of “what’s out there” ● No best practices for large shared environments ● Base batch capabilities incomplete ● Multi-cluster/Federation job support lacking ● Multi-tenancy is problematic

Slide 9

Slide 9 text

Initiatives

Slide 10

Slide 10 text

Current Initiatives Research Institution Survey Who is using Kubernetes for research? What type of workloads are they running? How have they deployed them? Index of resources and useful links “Awesome list” of research focused links Best practices for running research clusters Book on best practices Common topics and user stories Assemble best practices into an easily consumable resources Inspired from original initiative in OpenStack

Slide 11

Slide 11 text

Cloud Native Research Workloads Book Cluster Operations How to run research clusters Sys Admins / Operators, Research Software Engineers, Research Support Staff Making Research Cloud Native How “to do” research in a cloud native way Researchers, Teachers, Students

Slide 12

Slide 12 text

Cluster Operations ● Configuration Management & Cluster Lifecycle ● Management of Clusters on top of OpenStack ● Integration with Parallel / HPC file systems and access management ● Managing / Exposing GPUs ● Large Scale Clusters (1000+ nodes)

Slide 13

Slide 13 text

Initiatives Cloud Native Workloads Book: Cluster Operations

Slide 14

Slide 14 text

Configuration Management Best practices around provisioning and managing clusters. ● What tooling works well for provisioning (metal^3, kubespray, cluster-api etc) ● How is authn/authz handled ● Base security policy for multi-user / multi-tenant (OPA?)

Slide 15

Slide 15 text

OpenStack / Parallel File Systems OpenStack Large number of research sites are running OpenStack. ● Use ironic for bare metal provisioning? ● How to integrate with keystone? Parallel File System / HPC Integration with a classic POSIX environment is extremely difficult. ● How do you translate API identity to posix identity? ● Can a translation mechanism be developed thats portable across different workload types?

Slide 16

Slide 16 text

GPUs / Large Clusters Managing / Exposing GPUs GPU usage is becoming widespread, but is still a pain point to manage. ● How are GPUs and their features being exposed/consumed by researchers? Large Scale Clusters Large clusters require a significant amount of tuning. ● What are the best practices when running a cluster with 1000 nodes? 5000 nodes?

Slide 17

Slide 17 text

Initiatives Cloud Native Workloads Book: Making Research Cloud Native

Slide 18

Slide 18 text

Making Research Cloud Native ● Scalable JupyterHub ● Kubeflow / ML Lifecycle ● Batch / MPI ● Workflows / Workflow Engines / GitOps ● DataOps (Data Lifecycle Management)

Slide 19

Slide 19 text

Scalable JupyterHub JupyterHub is one of the most commonly deployed Research Application on top of Kubernetes. ● User/Identity Management ● Image Management ● Integration with classic POSIX environment (Kerberos)

Slide 20

Slide 20 text

Kubeflow / Batch - MPI Kubeflow / ML Lifecycle Kubeflow has become the go-to for an open source solution to ML on top of Kubernetes. ● Kubeflow has many tunables. What are the general best practices for deployment? ● How do you make it multi-tenant? Batch / MPI Kubernetes does not have direct support for batch/MPI jobs. ● Which tool(s) should be used and how? ● How can it bring Kubernetes closer to regular “batch” computing?

Slide 21

Slide 21 text

Workflow Engines / Data Ops Workflow Engines / GitOps There is a large ecosystem of workflow engines and tools. ● What tools are best suited for research use? ● How can GitOps be used effectively with research? DataOps Managing data in a “Cloud Native” environment is fundamentally different from an HPC environment. ● Moving away from classic POSIX to object. ● How should data be saved alongside an instance of a model?

Slide 22

Slide 22 text

What we need from you ● Are you working at a Research Organization? ○ Fill out or site survey! ● Have some useful links or resources? ○ Add them to our list! ● Are you a subject matter expert in one of the categories? ○ Help work on the Cloud Native Research Workloads Book

Slide 23

Slide 23 text

Where to find us ● GitHub Repo: https://github.com/cncf/research-user-group ● Mailing List: https://lists.cncf.io/g/cncf-research-user-group ● Meetings: ○ First Wednesday of the month at 11AM CET ○ Third Wednesday of the month at 5PM CET

Slide 24

Slide 24 text

No content