Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Intro to the CNCF Research User Group

Bob Killen
August 18, 2020

Intro to the CNCF Research User Group

Interested in improving the Research experience with Kubernetes, or simply running research workloads on it? The CNCF Research User Group’s purpose is to serve as a focal point for the discussion and advancement of Research Computing using “Cloud Native” technologies. Since the group’s inception 6 months ago, key areas have been identified as gaps within the ecosystem. This session would serve as an opportunity to share with a broader audience some of the key challenges the Research-user-group has identified, and showcase project updates on key tools that the research community is developing to address these challenges. For more information visit: https://github.com/cncf/research-user-group

Bob Killen

August 18, 2020
Tweet

More Decks by Bob Killen

Other Decks in Technology

Transcript

  1. CNCF
    Research User Group
    https://github.com/cncf/research-user-group
    Bob Killen - @mrbobbytables
    Eduardo Arango - @carlosearango

    View Slide

  2. Why?

    View Slide

  3. Why
    • Increased use of containers...everywhere.
    • Increasingly complex workflows.
    • Adoption of data-streaming and in-flight
    processing.
    • Greater use of interactive Science Gateways.
    • Dependence on other more persistent
    services.

    View Slide

  4. Why form a user group?
    Most research oriented workloads are different
    from typical Enterprise workloads.
    ○ Job/task focused (high rate of churn)
    ○ Resource intensive
    ○ Require more verbose scheduling (MPI)
    ○ Multitenant environment
    ○ Support for large or multiple clusters

    View Slide

  5. Why form a user group?
    Cloud Native is about tools but also about
    communities, we need a neutral place to share
    ○ Success stories
    ○ Pain points
    ○ Post-mortem stories
    ○ Good practices
    ○ And most importantly New ideas!

    View Slide

  6. Who’s involved

    View Slide

  7. TL;DR
    “The CNCF Research User Group’s purpose is to function as a
    focal point for the discussion and advancement of Research
    Computing using “Cloud Native” technologies. This includes
    enumerating current practices, identifying gaps, and directing
    effort to improve the Research Cloud Computing ecosystem.”

    View Slide

  8. Common Themes
    ● Lack of knowledge of “what’s out
    there”
    ● No best practices for large shared
    environments
    ● Base batch capabilities
    incomplete
    ● Multi-cluster/Federation job
    support lacking
    ● Multi-tenancy is problematic

    View Slide

  9. Initiatives

    View Slide

  10. Current Initiatives
    Research Institution
    Survey
    Who is using Kubernetes
    for research?
    What type of workloads
    are they running?
    How have they deployed
    them?
    Index of resources and
    useful links
    “Awesome list” of research
    focused links
    Best practices for running
    research clusters
    Book on best practices
    Common topics and user
    stories
    Assemble best practices
    into an easily consumable
    resources
    Inspired from original
    initiative in OpenStack

    View Slide

  11. Cloud Native Research Workloads Book
    Cluster Operations
    How to run research clusters
    Sys Admins / Operators, Research
    Software Engineers, Research
    Support Staff
    Making Research Cloud Native
    How “to do” research in a
    cloud native way
    Researchers, Teachers, Students

    View Slide

  12. Cluster Operations
    ● Configuration Management & Cluster Lifecycle
    ● Management of Clusters on top of OpenStack
    ● Integration with Parallel / HPC file systems and access
    management
    ● Managing / Exposing GPUs
    ● Large Scale Clusters (1000+ nodes)

    View Slide

  13. Initiatives
    Cloud Native Workloads Book: Cluster Operations

    View Slide

  14. Configuration Management
    Best practices around provisioning and managing clusters.
    ● What tooling works well for provisioning (metal^3, kubespray,
    cluster-api etc)
    ● How is authn/authz handled
    ● Base security policy for multi-user / multi-tenant (OPA?)

    View Slide

  15. OpenStack / Parallel File Systems
    OpenStack
    Large number of research sites
    are running OpenStack.
    ● Use ironic for bare metal
    provisioning?
    ● How to integrate with keystone?
    Parallel File System / HPC
    Integration with a classic POSIX
    environment is extremely difficult.
    ● How do you translate API identity
    to posix identity?
    ● Can a translation mechanism be
    developed thats portable across
    different workload types?

    View Slide

  16. GPUs / Large Clusters
    Managing / Exposing GPUs
    GPU usage is becoming
    widespread, but is still a pain
    point to manage.
    ● How are GPUs and their features
    being exposed/consumed by
    researchers?
    Large Scale Clusters
    Large clusters require a
    significant amount of tuning.
    ● What are the best practices when
    running a cluster with 1000
    nodes? 5000 nodes?

    View Slide

  17. Initiatives
    Cloud Native Workloads Book: Making Research Cloud Native

    View Slide

  18. Making Research Cloud Native
    ● Scalable JupyterHub
    ● Kubeflow / ML Lifecycle
    ● Batch / MPI
    ● Workflows / Workflow Engines / GitOps
    ● DataOps (Data Lifecycle Management)

    View Slide

  19. Scalable JupyterHub
    JupyterHub is one of the most commonly deployed Research
    Application on top of Kubernetes.
    ● User/Identity Management
    ● Image Management
    ● Integration with classic POSIX
    environment (Kerberos)

    View Slide

  20. Kubeflow / Batch - MPI
    Kubeflow / ML Lifecycle
    Kubeflow has become the go-to for an
    open source solution to ML on top of
    Kubernetes.
    ● Kubeflow has many tunables.
    What are the general best
    practices for deployment?
    ● How do you make it multi-tenant?
    Batch / MPI
    Kubernetes does not have direct
    support for batch/MPI jobs.
    ● Which tool(s) should be used and
    how?
    ● How can it bring Kubernetes
    closer to regular “batch”
    computing?

    View Slide

  21. Workflow Engines / Data Ops
    Workflow Engines / GitOps
    There is a large ecosystem of
    workflow engines and tools.
    ● What tools are best suited for
    research use?
    ● How can GitOps be used
    effectively with research?
    DataOps
    Managing data in a “Cloud Native”
    environment is fundamentally different
    from an HPC environment.
    ● Moving away from classic POSIX
    to object.
    ● How should data be saved
    alongside an instance of a model?

    View Slide

  22. What we need from you
    ● Are you working at a Research Organization?
    ○ Fill out or site survey!
    ● Have some useful links or resources?
    ○ Add them to our list!
    ● Are you a subject matter expert in one of the categories?
    ○ Help work on the Cloud Native Research Workloads
    Book

    View Slide

  23. Where to find us
    ● GitHub Repo: https://github.com/cncf/research-user-group
    ● Mailing List: https://lists.cncf.io/g/cncf-research-user-group
    ● Meetings:
    ○ First Wednesday of the month at 11AM CET
    ○ Third Wednesday of the month at 5PM CET

    View Slide

  24. View Slide