Spark on Kubernetes

Hadoop & Spark User Group Rhein-Main Alexey Novakov, Ultra Tendency
Spark on Kubernetes

Agenda Current state of Spark- Kubernetes integration Demo of Spark
Word Count program in K8s cluster on GCP

Spark Cluster Components

Kubernetes Acrhitecture

Supported Cluster Managers •  Standalone •  Apache Mesos •  Hadoop
YARN •  Kubernetes Native mode (since Spark v2.3)

Core Idea •  Is to make use of native Kubernetes
scheduler that has been added to Spark * Spark-Kubernetes scheduler is still experimental. There may be future changes in configuration, entrypoints, images, etc.

Before Native K8s support •  One could run Spark in
Standalone on K8s •  Deploy Spark Master POD •  Deploy Spark Workers PODs •  Submit a Job/Query via •  Notebook (Jupyter, Zeppelin, Spark-Notebook) •  Spark-submit script •  It works, but having less efficient cluster resources management, since Kubernetes Scheduler is not in the game

How it works: Cluster Mode (Spark v2.3.) https://spark.apache.org/docs/latest/running-on-kubernetes.html - Suppose
we have 2 Nodes K8s cluster - spark-submit command, which is a shell script to manage a Spark application - 3 Spark Executors are set as config parameter Given: Outcome: - Spark creates Driver as POD - Driver POD creates Executors as 3 PODs - When Job is completed: - Executors PODs are removed - Driver POD stays in COMPLETED state

submit How it works: Client Mode (Spark v2.4) - Client
Mode means: driver POD can be either: 1) as Kubernetes POD 2) or as a client outside of the K8s at all - In both cases a Driver POD must be routable from the Spark executor PODs Option 2 Client + Spark Driver Option 1 Client + Spark Driver apiserver scheduler apiserver Kubernetes Cluster executor 1 executor 2 executor 3 Use cases: - Interactive mode Spark: shell, Jupyter Notebook

Cluster Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \
--deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=<spark-image>\ local:///path/to/some-spark-job.jar * path in the Docker image of Driver/Executor container

Client Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \
--deploy-mode client \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=<spark-image>\ file://$SPARK_HOME/path/to/some-spark-job.jar * path to dependencies points to a URI served by Driver's web-server

Docker Images Spark Driver Image repo/spark-driver:v2.2.0-kubernetes-0.5.0 Spark Executor Image repo/spark-executor:v2.2.0-kubernetes-0.5.0
Spark Initialization Image repo/spark-init:v2.2.0-kubernetes-0.5.0 Spark Staging Server Image repo/spark-resource-staging-server:v2.2.0-kubernetes-0.5.0 Spark External Shuffle Service repo/spark-shuffle:v2.2.0-kubernetes-0.5.0 PySpark Driver Image repo/spark-driver-py:v2.2.0-kubernetes-0.5.0 PySpark Executor Image repo/spark-executor-py:v2.2.0-kubernetes-0.5.0 SparkR Driver Image repo/spark-driver-r:v2.2.0-kubernetes-0.5.0 SparkR Executor Image repo/spark-executor-r:v2.2.0-kubernetes-0.5.0 One can easilly built own images using available script at Spark GitHub repo

K8s Volume Mounts Volumes can be mounted to Driver &
Executor PODs 1.  hostPath: mounts a file or directory from the host node’s filesystem into a pod. 2.  emptyDir: an initially empty volume created when a pod is assigned to a node. 3.  persistentVolumeClaim: used to mount a PersistentVolume into a pod. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: spark-data-pvc labels: app: wikipedia-analyzer spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi

Future work for k8s •  Kerberos authentication •  Driver resilience
for Spark Streaming applications •  POD template (mount arbitrary volumes, ConfigMaps) •  Better support to upload dependencies (jar) from client •  Dynamic resource allocation and external shuffle service

•  Word Count •  Spark Shell / Zeppelin Notebook • 
Google File Storage

Demo 1: spark-shell POD POD POD *GCE VM - Google
Compute Engine Virtual Machine Spark Executors ssh GCE VM* Spark Driver api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster

Demo 2: Apache Zeppelin •  Vesion 0.9.0 Snapshot brings NEW
K8s support •  When Zeppelin runs in Pod, it creates pods for individual interpreter •  Key benefits are 1.  Interpreter scale-out 2.  Spark interpreter auto configure Spark on Kubernetes 3.  Able to customize Kubernetes yaml file 4.  Spark UI access

To run 0.9.0-SNAPSHOT •  Docker image can be built from
the master branch: mvn package -DskipTests -Pbuild-distr then build an image: gcr.io/spark-test-244110/zeppelin:0.9-SNAPSHOT (my image) •  kubectl appy -f zeppelin/k8s/zeppelin-server.yaml

Demo 2: zeppelin-spark-k8s exe- POD exe- POD exe- POD Executors
& driver GCE VM* api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster Interp reter/ Driver -POD Zeppelin -Server- POD port-forward to Zeppelin UI

Other topics Kubernetes Operator for Spark from Google Jupyter Notebook
with Scala in Kubernets client mode

Thank you! Questions? Spark on Kubernetes Alexey Novakov Twitter: @alexey_novakov
Email: novakov.alex at gmail.com

images https://unsplash.com/photos/_nqApgG-QrY https://unsplash.com/photos/MAYsdoYpGuk https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png

Spark on Kubernetes

Spark on Kubernetes

Alexey Novakov

More Decks by Alexey Novakov

Other Decks in Programming

Featured

Transcript

Hadoop & Spark User Group Rhein-Main Alexey Novakov, Ultra Tendency

Agenda Current state of Spark- Kubernetes integration Demo of Spark

Spark Cluster Components

Kubernetes Acrhitecture

Supported Cluster Managers •  Standalone •  Apache Mesos •  Hadoop

Core Idea •  Is to make use of native Kubernetes

Before Native K8s support •  One could run Spark in

How it works: Cluster Mode (Spark v2.3.) https://spark.apache.org/docs/latest/running-on-kubernetes.html - Suppose

submit How it works: Client Mode (Spark v2.4) - Client

Cluster Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \

Client Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \

Docker Images Spark Driver Image repo/spark-driver:v2.2.0-kubernetes-0.5.0 Spark Executor Image repo/spark-executor:v2.2.0-kubernetes-0.5.0

K8s Volume Mounts Volumes can be mounted to Driver &

Future work for k8s •  Kerberos authentication •  Driver resilience

•  Word Count •  Spark Shell / Zeppelin Notebook •

Demo 1: spark-shell POD POD POD *GCE VM - Google

Demo 2: Apache Zeppelin •  Vesion 0.9.0 Snapshot brings NEW

To run 0.9.0-SNAPSHOT •  Docker image can be built from

Demo 2: zeppelin-spark-k8s exe- POD exe- POD exe- POD Executors

Other topics Kubernetes Operator for Spark from Google Jupyter Notebook

Thank you! Questions? Spark on Kubernetes Alexey Novakov Twitter: @alexey_novakov

images https://unsplash.com/photos/_nqApgG-QrY https://unsplash.com/photos/MAYsdoYpGuk https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png