Interested in improving the Research experience with Kubernetes, or simply running research workloads on it? The CNCF Research User Group’s purpose is to serve as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. Since the group’s inception 6 months ago, key areas have been identified as gaps within the ecosystem. This session would serve as an opportunity to share with a broader audience some of the key challenges the Research-user-group has identified, and showcase project updates on key tools that the research community is developing to address these challenges. For more information visit: https://github.com/cncf/research-user-group
Research User Group
Bob Killen - @mrbobbytables
Eduardo Arango - @carlosearango
• Increased use of containers...everywhere.
• Increasingly complex workflows.
• Adoption of data-streaming and in-flight
• Greater use of interactive Science Gateways.
• Dependence on other more persistent
Why form a user group?
Most research oriented workloads are different
from typical Enterprise workloads.
○ Job/task focused (high rate of churn)
○ Resource intensive
○ Require more verbose scheduling (MPI)
○ Multitenant environment
○ Support for large or multiple clusters
Why form a user group?
Cloud Native is about tools but also about
communities, we need a neutral place to share
○ Success stories
○ Pain points
○ Post-mortem stories
○ Good practices
○ And most importantly New ideas!
“The CNCF Research User Group’s purpose is to function as a
focal point for the discussion and advancement of Research
Computing using “Cloud Native” technologies. This includes
enumerating current practices, identifying gaps, and directing
effort to improve the Research Cloud Computing ecosystem.”
● Lack of knowledge of “what’s out
● No best practices for large shared
● Base batch capabilities
● Multi-cluster/Federation job
● Multi-tenancy is problematic
Who is using Kubernetes
What type of workloads
are they running?
How have they deployed
Index of resources and
“Awesome list” of research
Best practices for running
Book on best practices
Common topics and user
Assemble best practices
into an easily consumable
Inspired from original
initiative in OpenStack
Cloud Native Research Workloads Book
How to run research clusters
Sys Admins / Operators, Research
Software Engineers, Research
Making Research Cloud Native
How “to do” research in a
cloud native way
Researchers, Teachers, Students
● Configuration Management & Cluster Lifecycle
● Management of Clusters on top of OpenStack
● Integration with Parallel / HPC file systems and access
● Managing / Exposing GPUs
● Large Scale Clusters (1000+ nodes)
Cloud Native Workloads Book: Cluster Operations
Best practices around provisioning and managing clusters.
● What tooling works well for provisioning (metal^3, kubespray,
● How is authn/authz handled
● Base security policy for multi-user / multi-tenant (OPA?)
OpenStack / Parallel File Systems
Large number of research sites
are running OpenStack.
● Use ironic for bare metal
● How to integrate with keystone?
Parallel File System / HPC
Integration with a classic POSIX
environment is extremely difficult.
● How do you translate API identity
to posix identity?
● Can a translation mechanism be
developed thats portable across
different workload types?
GPUs / Large Clusters
Managing / Exposing GPUs
GPU usage is becoming
widespread, but is still a pain
point to manage.
● How are GPUs and their features
being exposed/consumed by
Large Scale Clusters
Large clusters require a
significant amount of tuning.
● What are the best practices when
running a cluster with 1000
nodes? 5000 nodes?
Cloud Native Workloads Book: Making Research Cloud Native
Making Research Cloud Native
● Scalable JupyterHub
● Kubeflow / ML Lifecycle
● Batch / MPI
● Workflows / Workflow Engines / GitOps
● DataOps (Data Lifecycle Management)
JupyterHub is one of the most commonly deployed Research
Application on top of Kubernetes.
● User/Identity Management
● Image Management
● Integration with classic POSIX
Kubeflow / Batch - MPI
Kubeflow / ML Lifecycle
Kubeflow has become the go-to for an
open source solution to ML on top of
● Kubeflow has many tunables.
What are the general best
practices for deployment?
● How do you make it multi-tenant?
Batch / MPI
Kubernetes does not have direct
support for batch/MPI jobs.
● Which tool(s) should be used and
● How can it bring Kubernetes
closer to regular “batch”
Workflow Engines / Data Ops
Workflow Engines / GitOps
There is a large ecosystem of
workflow engines and tools.
● What tools are best suited for
● How can GitOps be used
effectively with research?
Managing data in a “Cloud Native”
environment is fundamentally different
from an HPC environment.
● Moving away from classic POSIX
● How should data be saved
alongside an instance of a model?
What we need from you
● Are you working at a Research Organization?
○ Fill out or site survey!
● Have some useful links or resources?
○ Add them to our list!
● Are you a subject matter expert in one of the categories?
○ Help work on the Cloud Native Research Workloads
Where to find us
● GitHub Repo: https://github.com/cncf/research-user-group
● Mailing List: https://lists.cncf.io/g/cncf-research-user-group
○ First Wednesday of the month at 11AM CET
○ Third Wednesday of the month at 5PM CET