$30 off During Our Annual Pro Sale. View Details »

Kubernetes - The Next Research Platform

Bob Killen
March 13, 2019
210

Kubernetes - The Next Research Platform

Kubernetes has become the defacto standard as a platform for container orchestration. Its ease of extending and many integrations has paved the way for a wide variety of data science and research tooling to be built on top of it.

From all encompassing tools like Kubeflow that make it easy for researchers to build end-to-end Machine Learning pipelines to specific orchestration of analytics engines such as Spark; Kubernetes has made the deployment and management of these things easy. This presentation will showcase some of the larger research tools in the ecosystem and go into how Kubernetes has enabled this easy form of application management.

Bob Killen

March 13, 2019
Tweet

Transcript

  1. Kubernetes
    The Next Research Platform

    View Slide

  2. $ whoami
    Bob Killen
    [email protected]
    Senior Research Cloud Administrator
    CNCF Ambassador
    GitHub: @mrbobbytables
    Twitter: @mrbobbytables

    View Slide

  3. Kubernetes TL;DR Edition
    ● Greek for “Pilot” or “Helmsman of a ship”.
    ● Container orchestration system originally developed at Google.
    ● Built with lessons learned from Borg and Omega.
    ● Designed from the ground-up as a loosely coupled collection of components
    centered around deploying, maintaining and scaling workloads.
    ● Supports both on-prem and cloud provider deployments.

    View Slide

  4. Kubernetes TL;DR Edition
    ● Declarative system.
    ● Steers cluster towards desired
    state.
    ● EVERYTHING is an API Object.
    ● Objects generally describe in
    YAML.
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: job-example
    spec:
    backoffLimit: 4
    completions: 4
    parallelism: 2
    template:
    spec:
    containers:
    - name: hello
    image: alpine:latest
    command: ["/bin/sh", "-c"]
    args: ["echo hello from $HOSTNAME!"]
    restartPolicy: Never

    View Slide

  5. Why?

    View Slide

  6. Research needs
    are changing.

    View Slide

  7. Why?
    ● Increased use of containers...everywhere.
    ● Moving away from strict “job” style workflows.
    ● Adoption of data-streaming and in-flight processing.
    ● Greater use of interactive Science Gateways.
    ● Dependence on other more persistent services.

    View Slide

  8. Why Kubernetes?
    ● Kubernetes is seeing significant adoption across
    Enterprises and multiple fields of research; serving as
    both a scientific platform and substrate for application
    management.
    ● Very large, active development community.
    ● Extremely easy to extend, augment, and integrate with
    other systems.

    View Slide

  9. Why Kubernetes?
    Use the SAME API
    across bare metal
    and EVERY cloud
    provider.

    View Slide

  10. Challenges
    ● Difficult to integrate with classic multi-user posix
    infrastructure.
    ○ Translating API level identity to posix identity.
    ● Installation on-prem/bare-metal is not as well supported.
    ● Device support and integration is a pain point.
    ○ GPUs well supported, other devices -- not as much.

    View Slide

  11. Challenges with Regard to HPC
    ● Difficult to integrate with classic multi-user posix
    infrastructure.
    ○ Translating API level identity to posix identity.
    ● No “native” concept of job queue or wall time.
    ○ Up to higher level components to extend and add that functionality.
    ● Scheduler generally not as expressive as common HPC
    workload managers such as Slurm or Torque.

    View Slide

  12. Challenges
    Very high learning curve
    coming from a traditional
    infrastructure background.

    View Slide

  13. Ecosystem

    View Slide

  14. Helm
    https://helm.sh

    View Slide

  15. Helm
    ● “Package manager” for Kubernetes.
    ○ User only have to configures a few variables for their site without needing
    to know majority of details of the application.
    ● Many commonly used applications packaged and
    distributed as “Helm Charts”.

    View Slide

  16. List of Charts
    ● Aerospike
    ● Airflow
    ● Argo
    ● CockroachDB
    ● Dask
    ● Flink
    ● Hadoop
    ● Galaxy
    ● Hazelcast
    ● Ignite
    ● Jenkins
    ● JanusGraph
    ● JupyterHub
    ● Kafka
    ● KubeDB
    ● Luigi
    ● MariaDB
    ● Metabase
    ● MongoDB
    ● Moodle
    ● NATS
    ● Pachyderm
    ● Postgres
    ● Presto
    ● Pulsar
    ● RECAST
    ● RabbitMQ
    ● Spark
    ● Tensorflow
    ● Terracotta
    ● Zookeeper

    View Slide

  17. Controllers &
    Custom Resources
    https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

    View Slide

  18. Controllers & Custom Resources
    ● Custom Resource Definition (CRD).
    ● Extends current Kubernetes resources.
    ● Create your own Kubernetes API object that can be
    consumed in the SAME WAY with the SAME TOOLS as
    every other Kubernetes object.
    ● Add custom behaviors to workload management.

    View Slide

  19. Example: CRD
    apiVersion: apiextensions.k8s.io/v1beta1
    kind: CustomResourceDefinition
    metadata:
    name: foo.bar.example.com
    spec:
    group: bar.example.com
    version: v1alpha1
    scope: Namespaced
    names:
    plural: foos
    singular: foo
    kind: Foo
    validation:
    openAPIV3Schema:
    properties:
    spec:
    properties:
    varFoo:
    type: string
    apiVersion: foo.bar.example.com/v1alpha1
    kind: Foo
    metadata:
    name: myfoo
    spec:
    varFoo: bar

    View Slide

  20. Example: Kube-batch
    ● Controller that adds coscheduling (gang scheduling) in the form of a
    PodGroup object and additional scheduler.
    ● Developed by Huawei & IBM.
    ● Job Queues on the Road Map.
    https://github.com/kubernetes-sigs/kube-batch
    apiVersion: scheduling.incubator.k8s.io/v1alpha1
    kind: PodGroup
    metadata:
    name: MPIGroup
    spec:
    minMember: 6

    View Slide

  21. Example: Argo
    ● Powerful suite of workflow tools.
    ● Workflow engine supports both DAG and
    Pipeline based workflows.
    ● Built-in Event system.
    ● Integrated and used by many other organizations and
    projects.

    View Slide

  22. Native vs CRD
    apiVersion: batch/v1
    kind: Job
    metadata:
    name: hello-world
    spec:
    completions: 1
    template:
    spec:
    containers:
    - name: hello
    image: alpine:latest
    command: ["/bin/sh", "-c"]
    args: ["echo Hello World”]
    restartPolicy: Never
    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
    generateName: hello-world-
    spec:
    entrypoint: hello
    arguments:
    parameters:
    - name: message
    value: Hello World
    templates:
    - name: hello
    inputs:
    parameters:
    - name: message
    container:
    image: alpine:latest
    command: ["/bin/sh", "-c"]
    args: ["echo {{inputs.parameters.message}}"]

    View Slide

  23. Operators
    https://www.operatorhub.io/what-is-an-operator

    View Slide

  24. Operators

    View Slide

  25. Operator Pattern
    ● Uses Controllers & CRDs to manage complex applications.
    ● Introduced by CoreOS in 2016.
    ● Automatically handle full application lifecycle: Install,
    Configuration, Upgrade, Backup, Failover and Scaling.
    ● Multiple frameworks available supporting a wide range of
    languages and components.

    View Slide

  26. Example: Spark
    ● Kubernetes supported as an executor in 2.3+
    ● Spark maintainers pursued developing their own controller
    as Spark workload patterns did not fit with out of the box
    Kubernetes core workload types.
    ● Bypasses “default” Spark job submission process and
    uses a SparkApplication CRD.
    https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

    View Slide

  27. Example job
    apiVersion: sparkoperator.k8s.io/v1beta1
    kind: SparkApplication
    metadata:
    name: myspark
    spec:
    type: Scala
    mode: cluster
    image: gcr.io/spark-operator/spark:v2.4.0
    mainClass: org.apache.spark.examples.SparkPi
    mainApplicationFile: local://spark-example.jar
    sparkVersion: 2.4.0
    volumes:
    - name: test-volume
    hostPath:
    path: "/tmp"
    type: Directory


    driver:
    cores: 0.1
    coreLimit: 200m
    memory: 512m
    labels:
    version: 2.4.0
    volumeMounts:
    - name: test-volume
    mountPath: /tmp
    executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
    version: 2.4.0
    volumeMounts:
    - name: test-volume
    mountPath: /tmp

    View Slide

  28. Example: Kubeflow
    “The Kubeflow project is dedicated to making deployments of machine
    learning (ML) workflows on Kubernetes simple, portable and scalable.
    Our goal is not to recreate other services, but to provide a straightforward
    way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
    Anywhere you are running Kubernetes, you should be able to run Kubeflow.”
    https://www.kubeflow.org/

    View Slide

  29. Example: Kubeflow
    “The Kubeflow project is dedicated to making deployments of machine
    learning (ML) workflows on Kubernetes simple, portable and scalable.
    Our goal is not to recreate other services, but to provide a straightforward
    way to deploy best-of-breed open-source systems for ML to diverse infrastructures.
    Anywhere you are running Kubernetes, you should be able to run Kubeflow.”
    Comprehensive Machine Learning Suite.
    https://www.kubeflow.org/

    View Slide

  30. Kubeflow Features & Integrations
    ● Chainer Training
    ● Hyperparameter Tuning (Katib)
    ● Istio Integration (for TF Serving)
    ● Jupyter Notebooks
    ● ModelDB
    ● ksonnet
    ● MPI Training
    ● MXNet Training
    ● Pipelines
    ● PyTorch Training
    ● Seldon Serving
    ● NVIDIA TensorRT Inference Server
    ● TensorFlow Serving
    ● TensorFlow Batch Predict
    ● TensorFlow Training (TFJob)
    ● PyTorch Serving

    View Slide

  31. Kubeflow Features & Integrations
    ● Chainer Training
    ● Hyperparameter Tuning (Katib)
    ● Istio Integration (for TF Serving)
    ● Jupyter Notebooks
    ● ModelDB
    ● ksonnet
    ● MPI Training
    ● MXNet Training
    ● Pipelines
    ● PyTorch Training
    ● Seldon Serving
    ● NVIDIA TensorRT Inference Server
    ● TensorFlow Serving
    ● TensorFlow Batch Predict
    ● TensorFlow Training (TFJob)
    ● PyTorch Serving

    View Slide

  32. Others
    ● Aerospike
    ● Airflow
    ● ArangoDB
    ● Cassandra
    ● CouchDB
    ● Federation-v2
    ● Flink
    ● Gluster
    ● Kafka
    ● KubeDB
    ● MongoDB
    ● MySQL
    ● NATS
    ● PostgreSQL
    ● Rook
    ● Velero
    ● Vitess
    ● Zookeeper

    View Slide

  33. Why Kubernetes?

    View Slide

  34. What containers have done for
    code, application portability and
    reproducible research --
    Kubernetes has done for the
    orchestration and management
    of those things.

    View Slide

  35. Complex applications can be
    packaged and distributed easily.
    If Kubernetes does not provide
    the needed primitives, it is easy
    enough to extend.

    View Slide

  36. Questions?
    [email protected]
    GitHub: @mrbobbytables
    Twitter: @mrbobbytables

    View Slide