Slide 1

Slide 1 text

Hadoop & Spark User Group Rhein-Main Alexey Novakov, Ultra Tendency Spark on Kubernetes

Slide 2

Slide 2 text

Agenda Current state of Spark- Kubernetes integration Demo of Spark Word Count program in K8s cluster on GCP

Slide 3

Slide 3 text

Spark Cluster Components

Slide 4

Slide 4 text

Kubernetes Acrhitecture

Slide 5

Slide 5 text

Supported Cluster Managers •  Standalone •  Apache Mesos •  Hadoop YARN •  Kubernetes Native mode (since Spark v2.3)

Slide 6

Slide 6 text

Core Idea •  Is to make use of native Kubernetes scheduler that has been added to Spark * Spark-Kubernetes scheduler is still experimental. There may be future changes in configuration, entrypoints, images, etc.

Slide 7

Slide 7 text

Before Native K8s support •  One could run Spark in Standalone on K8s •  Deploy Spark Master POD •  Deploy Spark Workers PODs •  Submit a Job/Query via •  Notebook (Jupyter, Zeppelin, Spark-Notebook) •  Spark-submit script •  It works, but having less efficient cluster resources management, since Kubernetes Scheduler is not in the game

Slide 8

Slide 8 text

How it works: Cluster Mode (Spark v2.3.) https://spark.apache.org/docs/latest/running-on-kubernetes.html - Suppose we have 2 Nodes K8s cluster - spark-submit command, which is a shell script to manage a Spark application - 3 Spark Executors are set as config parameter Given: Outcome: - Spark creates Driver as POD - Driver POD creates Executors as 3 PODs - When Job is completed: - Executors PODs are removed - Driver POD stays in COMPLETED state

Slide 9

Slide 9 text

submit How it works: Client Mode (Spark v2.4) - Client Mode means: driver POD can be either: 1) as Kubernetes POD 2) or as a client outside of the K8s at all - In both cases a Driver POD must be routable from the Spark executor PODs Option 2 Client + Spark Driver Option 1 Client + Spark Driver apiserver scheduler apiserver Kubernetes Cluster executor 1 executor 2 executor 3 Use cases: - Interactive mode Spark: shell, Jupyter Notebook

Slide 10

Slide 10 text

Cluster Mode: Submit Job $ bin/spark-submit \ --master k8s://https://: \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=\ local:///path/to/some-spark-job.jar * path in the Docker image of Driver/Executor container

Slide 11

Slide 11 text

Client Mode: Submit Job $ bin/spark-submit \ --master k8s://https://: \ --deploy-mode client \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=\ file://$SPARK_HOME/path/to/some-spark-job.jar * path to dependencies points to a URI served by Driver's web-server

Slide 12

Slide 12 text

Docker Images Spark Driver Image repo/spark-driver:v2.2.0-kubernetes-0.5.0 Spark Executor Image repo/spark-executor:v2.2.0-kubernetes-0.5.0 Spark Initialization Image repo/spark-init:v2.2.0-kubernetes-0.5.0 Spark Staging Server Image repo/spark-resource-staging-server:v2.2.0-kubernetes-0.5.0 Spark External Shuffle Service repo/spark-shuffle:v2.2.0-kubernetes-0.5.0 PySpark Driver Image repo/spark-driver-py:v2.2.0-kubernetes-0.5.0 PySpark Executor Image repo/spark-executor-py:v2.2.0-kubernetes-0.5.0 SparkR Driver Image repo/spark-driver-r:v2.2.0-kubernetes-0.5.0 SparkR Executor Image repo/spark-executor-r:v2.2.0-kubernetes-0.5.0 One can easilly built own images using available script at Spark GitHub repo

Slide 13

Slide 13 text

K8s Volume Mounts Volumes can be mounted to Driver & Executor PODs 1.  hostPath: mounts a file or directory from the host node’s filesystem into a pod. 2.  emptyDir: an initially empty volume created when a pod is assigned to a node. 3.  persistentVolumeClaim: used to mount a PersistentVolume into a pod. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: spark-data-pvc labels: app: wikipedia-analyzer spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi

Slide 14

Slide 14 text

Future work for k8s •  Kerberos authentication •  Driver resilience for Spark Streaming applications •  POD template (mount arbitrary volumes, ConfigMaps) •  Better support to upload dependencies (jar) from client •  Dynamic resource allocation and external shuffle service

Slide 15

Slide 15 text

•  Word Count •  Spark Shell / Zeppelin Notebook •  Google File Storage

Slide 16

Slide 16 text

Demo 1: spark-shell POD POD POD *GCE VM - Google Compute Engine Virtual Machine Spark Executors ssh GCE VM* Spark Driver api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster

Slide 17

Slide 17 text

Demo 2: Apache Zeppelin •  Vesion 0.9.0 Snapshot brings NEW K8s support •  When Zeppelin runs in Pod, it creates pods for individual interpreter •  Key benefits are 1.  Interpreter scale-out 2.  Spark interpreter auto configure Spark on Kubernetes 3.  Able to customize Kubernetes yaml file 4.  Spark UI access

Slide 18

Slide 18 text

To run 0.9.0-SNAPSHOT •  Docker image can be built from the master branch: mvn package -DskipTests -Pbuild-distr then build an image: gcr.io/spark-test-244110/zeppelin:0.9-SNAPSHOT (my image) •  kubectl appy -f zeppelin/k8s/zeppelin-server.yaml

Slide 19

Slide 19 text

Demo 2: zeppelin-spark-k8s exe- POD exe- POD exe- POD Executors & driver GCE VM* api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster Interp reter/ Driver -POD Zeppelin -Server- POD port-forward to Zeppelin UI

Slide 20

Slide 20 text

Other topics Kubernetes Operator for Spark from Google Jupyter Notebook with Scala in Kubernets client mode

Slide 21

Slide 21 text

Thank you! Questions? Spark on Kubernetes Alexey Novakov Twitter: @alexey_novakov Email: novakov.alex at gmail.com

Slide 22

Slide 22 text

images https://unsplash.com/photos/_nqApgG-QrY https://unsplash.com/photos/MAYsdoYpGuk https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png