Kubernetes - The Next Research Platform

Kubernetes The Next Research Platform

$ whoami Bob Killen [email protected] Senior Research Cloud Administrator CNCF
Ambassador GitHub: @mrbobbytables Twitter: @mrbobbytables

Kubernetes TL;DR Edition • Greek for “Pilot” or “Helmsman of
a ship”. • Container orchestration system originally developed at Google. • Built with lessons learned from Borg and Omega. • Designed from the ground-up as a loosely coupled collection of components centered around deploying, maintaining and scaling workloads. • Supports both on-prem and cloud provider deployments.

Kubernetes TL;DR Edition • Declarative system. • Steers cluster towards
desired state. • EVERYTHING is an API Object. • Objects generally describe in YAML. apiVersion: batch/v1 kind: Job metadata: name: job-example spec: backoffLimit: 4 completions: 4 parallelism: 2 template: spec: containers: - name: hello image: alpine:latest command: ["/bin/sh", "-c"] args: ["echo hello from $HOSTNAME!"] restartPolicy: Never

Research needs are changing.

Why? • Increased use of containers...everywhere. • Moving away from
strict “job” style workflows. • Adoption of data-streaming and in-flight processing. • Greater use of interactive Science Gateways. • Dependence on other more persistent services.

Why Kubernetes? • Kubernetes is seeing significant adoption across Enterprises
and multiple fields of research; serving as both a scientific platform and substrate for application management. • Very large, active development community. • Extremely easy to extend, augment, and integrate with other systems.

Why Kubernetes? Use the SAME API across bare metal and
EVERY cloud provider.

Challenges • Difficult to integrate with classic multi-user posix infrastructure.
◦ Translating API level identity to posix identity. • Installation on-prem/bare-metal is not as well supported. • Device support and integration is a pain point. ◦ GPUs well supported, other devices -- not as much.

Challenges with Regard to HPC • Difficult to integrate with
classic multi-user posix infrastructure. ◦ Translating API level identity to posix identity. • No “native” concept of job queue or wall time. ◦ Up to higher level components to extend and add that functionality. • Scheduler generally not as expressive as common HPC workload managers such as Slurm or Torque.

Challenges Very high learning curve coming from a traditional infrastructure
background.

Ecosystem

Helm https://helm.sh

Helm • “Package manager” for Kubernetes. ◦ User only have
to configures a few variables for their site without needing to know majority of details of the application. • Many commonly used applications packaged and distributed as “Helm Charts”.

List of Charts • Aerospike • Airflow • Argo •
CockroachDB • Dask • Flink • Hadoop • Galaxy • Hazelcast • Ignite • Jenkins • JanusGraph • JupyterHub • Kafka • KubeDB • Luigi • MariaDB • Metabase • MongoDB • Moodle • NATS • Pachyderm • Postgres • Presto • Pulsar • RECAST • RabbitMQ • Spark • Tensorflow • Terracotta • Zookeeper

Controllers & Custom Resources https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/

Controllers & Custom Resources • Custom Resource Definition (CRD). •
Extends current Kubernetes resources. • Create your own Kubernetes API object that can be consumed in the SAME WAY with the SAME TOOLS as every other Kubernetes object. • Add custom behaviors to workload management.

Example: CRD apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: name: foo.bar.example.com spec:
group: bar.example.com version: v1alpha1 scope: Namespaced names: plural: foos singular: foo kind: Foo validation: openAPIV3Schema: properties: spec: properties: varFoo: type: string apiVersion: foo.bar.example.com/v1alpha1 kind: Foo metadata: name: myfoo spec: varFoo: bar

Example: Kube-batch • Controller that adds coscheduling (gang scheduling) in
the form of a PodGroup object and additional scheduler. • Developed by Huawei & IBM. • Job Queues on the Road Map. https://github.com/kubernetes-sigs/kube-batch apiVersion: scheduling.incubator.k8s.io/v1alpha1 kind: PodGroup metadata: name: MPIGroup spec: minMember: 6

Example: Argo • Powerful suite of workflow tools. • Workflow
engine supports both DAG and Pipeline based workflows. • Built-in Event system. • Integrated and used by many other organizations and projects.

Native vs CRD apiVersion: batch/v1 kind: Job metadata: name: hello-world
spec: completions: 1 template: spec: containers: - name: hello image: alpine:latest command: ["/bin/sh", "-c"] args: ["echo Hello World”] restartPolicy: Never apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata: generateName: hello-world- spec: entrypoint: hello arguments: parameters: - name: message value: Hello World templates: - name: hello inputs: parameters: - name: message container: image: alpine:latest command: ["/bin/sh", "-c"] args: ["echo {{inputs.parameters.message}}"]

Operators https://www.operatorhub.io/what-is-an-operator

Operators

Operator Pattern • Uses Controllers & CRDs to manage complex
applications. • Introduced by CoreOS in 2016. • Automatically handle full application lifecycle: Install, Configuration, Upgrade, Backup, Failover and Scaling. • Multiple frameworks available supporting a wide range of languages and components.

Example: Spark • Kubernetes supported as an executor in 2.3+
• Spark maintainers pursued developing their own controller as Spark workload patterns did not fit with out of the box Kubernetes core workload types. • Bypasses “default” Spark job submission process and uses a SparkApplication CRD. https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

Example job apiVersion: sparkoperator.k8s.io/v1beta1 kind: SparkApplication metadata: name: myspark spec:
type: Scala mode: cluster image: gcr.io/spark-operator/spark:v2.4.0 mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local://spark-example.jar sparkVersion: 2.4.0 volumes: - name: test-volume hostPath: path: "/tmp" type: Directory <continued> <continued> driver: cores: 0.1 coreLimit: 200m memory: 512m labels: version: 2.4.0 volumeMounts: - name: test-volume mountPath: /tmp executor: cores: 1 instances: 1 memory: "512m" labels: version: 2.4.0 volumeMounts: - name: test-volume mountPath: /tmp

Example: Kubeflow “The Kubeflow project is dedicated to making deployments
of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.” https://www.kubeflow.org/

Example: Kubeflow “The Kubeflow project is dedicated to making deployments
of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.” Comprehensive Machine Learning Suite. https://www.kubeflow.org/

Kubeflow Features & Integrations • Chainer Training • Hyperparameter Tuning
(Katib) • Istio Integration (for TF Serving) • Jupyter Notebooks • ModelDB • ksonnet • MPI Training • MXNet Training • Pipelines • PyTorch Training • Seldon Serving • NVIDIA TensorRT Inference Server • TensorFlow Serving • TensorFlow Batch Predict • TensorFlow Training (TFJob) • PyTorch Serving

Others • Aerospike • Airflow • ArangoDB • Cassandra •
CouchDB • Federation-v2 • Flink • Gluster • Kafka • KubeDB • MongoDB • MySQL • NATS • PostgreSQL • Rook • Velero • Vitess • Zookeeper

Why Kubernetes?

What containers have done for code, application portability and reproducible
research -- Kubernetes has done for the orchestration and management of those things.

Complex applications can be packaged and distributed easily. If Kubernetes
does not provide the needed primitives, it is easy enough to extend.

Questions? [email protected] GitHub: @mrbobbytables Twitter: @mrbobbytables

Kubernetes - The Next Research Platform

Kubernetes - The Next Research Platform

More Decks by Bob Killen

Featured

Transcript