Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark on Kubernetes

Spark on Kubernetes

New Spark version is getting closer to use all possible native Kubernetes scheduler features. Latest Spark comes with support of client-mode in Kubernetes, which enables us to use Spark Notebooks and work in interactive mode. In this talk we will learn how Spark is using Kubernetes to spin up Spark executors and coordinate Spark jobs with a driver process. It will be a short demo having Spark Job running in Google Kubernetes Engine and Apache Zeppelin notebook in the client-mode. We will look at two scenarios: driver inside the K8s cluster, a driver outside the cluster. We will also talk about new K8s-Spark integration features and future plans of Spark community with regards to K8s.

Alexey Novakov

August 27, 2019
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. Supported Cluster Managers •  Standalone •  Apache Mesos •  Hadoop

    YARN •  Kubernetes Native mode (since Spark v2.3)
  2. Core Idea •  Is to make use of native Kubernetes

    scheduler that has been added to Spark * Spark-Kubernetes scheduler is still experimental. There may be future changes in configuration, entrypoints, images, etc.
  3. Before Native K8s support •  One could run Spark in

    Standalone on K8s •  Deploy Spark Master POD •  Deploy Spark Workers PODs •  Submit a Job/Query via •  Notebook (Jupyter, Zeppelin, Spark-Notebook) •  Spark-submit script •  It works, but having less efficient cluster resources management, since Kubernetes Scheduler is not in the game
  4. How it works: Cluster Mode (Spark v2.3.) https://spark.apache.org/docs/latest/running-on-kubernetes.html - Suppose

    we have 2 Nodes K8s cluster - spark-submit command, which is a shell script to manage a Spark application - 3 Spark Executors are set as config parameter Given: Outcome: - Spark creates Driver as POD - Driver POD creates Executors as 3 PODs - When Job is completed: - Executors PODs are removed - Driver POD stays in COMPLETED state
  5. submit How it works: Client Mode (Spark v2.4) - Client

    Mode means: driver POD can be either: 1) as Kubernetes POD 2) or as a client outside of the K8s at all - In both cases a Driver POD must be routable from the Spark executor PODs Option 2 Client + Spark Driver Option 1 Client + Spark Driver apiserver scheduler apiserver Kubernetes Cluster executor 1 executor 2 executor 3 Use cases: - Interactive mode Spark: shell, Jupyter Notebook
  6. Cluster Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \

    --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=<spark-image>\ local:///path/to/some-spark-job.jar * path in the Docker image of Driver/Executor container
  7. Client Mode: Submit Job $ bin/spark-submit \ --master k8s://https://<api-server-host>:<api-server-port> \

    --deploy-mode client \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.app.name=spark-pi \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=<spark-image>\ file://$SPARK_HOME/path/to/some-spark-job.jar * path to dependencies points to a URI served by Driver's web-server
  8. Docker Images Spark Driver Image repo/spark-driver:v2.2.0-kubernetes-0.5.0 Spark Executor Image repo/spark-executor:v2.2.0-kubernetes-0.5.0

    Spark Initialization Image repo/spark-init:v2.2.0-kubernetes-0.5.0 Spark Staging Server Image repo/spark-resource-staging-server:v2.2.0-kubernetes-0.5.0 Spark External Shuffle Service repo/spark-shuffle:v2.2.0-kubernetes-0.5.0 PySpark Driver Image repo/spark-driver-py:v2.2.0-kubernetes-0.5.0 PySpark Executor Image repo/spark-executor-py:v2.2.0-kubernetes-0.5.0 SparkR Driver Image repo/spark-driver-r:v2.2.0-kubernetes-0.5.0 SparkR Executor Image repo/spark-executor-r:v2.2.0-kubernetes-0.5.0 One can easilly built own images using available script at Spark GitHub repo
  9. K8s Volume Mounts Volumes can be mounted to Driver &

    Executor PODs 1.  hostPath: mounts a file or directory from the host node’s filesystem into a pod. 2.  emptyDir: an initially empty volume created when a pod is assigned to a node. 3.  persistentVolumeClaim: used to mount a PersistentVolume into a pod. apiVersion: v1 kind: PersistentVolumeClaim metadata: name: spark-data-pvc labels: app: wikipedia-analyzer spec: accessModes: - ReadOnlyMany resources: requests: storage: 1Gi
  10. Future work for k8s •  Kerberos authentication •  Driver resilience

    for Spark Streaming applications •  POD template (mount arbitrary volumes, ConfigMaps) •  Better support to upload dependencies (jar) from client •  Dynamic resource allocation and external shuffle service
  11. Demo 1: spark-shell POD POD POD *GCE VM - Google

    Compute Engine Virtual Machine Spark Executors ssh GCE VM* Spark Driver api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster
  12. Demo 2: Apache Zeppelin •  Vesion 0.9.0 Snapshot brings NEW

    K8s support •  When Zeppelin runs in Pod, it creates pods for individual interpreter •  Key benefits are 1.  Interpreter scale-out 2.  Spark interpreter auto configure Spark on Kubernetes 3.  Able to customize Kubernetes yaml file 4.  Spark UI access
  13. To run 0.9.0-SNAPSHOT •  Docker image can be built from

    the master branch: mvn package -DskipTests -Pbuild-distr then build an image: gcr.io/spark-test-244110/zeppelin:0.9-SNAPSHOT (my image) •  kubectl appy -f zeppelin/k8s/zeppelin-server.yaml
  14. Demo 2: zeppelin-spark-k8s exe- POD exe- POD exe- POD Executors

    & driver GCE VM* api- server K8s Node 1 K8s Node 2 GCE VM* Kubernetes Cluster Interp reter/ Driver -POD Zeppelin -Server- POD port-forward to Zeppelin UI