Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark on Kubernetes

Spark on Kubernetes

New Spark version is getting closer to use all possible native Kubernetes scheduler features. Latest Spark comes with support of client-mode in Kubernetes, which enables us to use Spark Notebooks and work in interactive mode. In this talk we will learn how Spark is using Kubernetes to spin up Spark executors and coordinate Spark jobs with a driver process. It will be a short demo having Spark Job running in Google Kubernetes Engine and Apache Zeppelin notebook in the client-mode. We will look at two scenarios: driver inside the K8s cluster, a driver outside the cluster. We will also talk about new K8s-Spark integration features and future plans of Spark community with regards to K8s.

Alexey Novakov

August 27, 2019
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. Hadoop & Spark User Group Rhein-Main
    Alexey Novakov, Ultra Tendency
    Spark on Kubernetes

    View Slide

  2. Agenda
    Current state of Spark-
    Kubernetes integration
    Demo of Spark Word Count
    program in K8s cluster on GCP

    View Slide

  3. Spark Cluster Components

    View Slide

  4. Kubernetes Acrhitecture

    View Slide

  5. Supported Cluster
    Managers
    •  Standalone
    •  Apache Mesos
    •  Hadoop YARN
    •  Kubernetes Native mode (since
    Spark v2.3)

    View Slide

  6. Core Idea
    •  Is to make use of native
    Kubernetes scheduler that has
    been added to Spark
    * Spark-Kubernetes scheduler is still
    experimental. There may be future changes in
    configuration, entrypoints, images, etc.

    View Slide

  7. Before Native K8s support
    •  One could run Spark in Standalone on K8s
    •  Deploy Spark Master POD
    •  Deploy Spark Workers PODs
    •  Submit a Job/Query via
    •  Notebook (Jupyter, Zeppelin, Spark-Notebook)
    •  Spark-submit script
    •  It works, but having less efficient cluster resources
    management, since Kubernetes Scheduler is not in the
    game

    View Slide

  8. How it works: Cluster Mode (Spark v2.3.)
    https://spark.apache.org/docs/latest/running-on-kubernetes.html
    - Suppose we have 2 Nodes K8s cluster
    - spark-submit command, which is a shell
    script to manage a Spark application
    - 3 Spark Executors are set as config
    parameter
    Given:
    Outcome:
    - Spark creates Driver as POD
    - Driver POD creates Executors as 3 PODs
    - When Job is completed:
    - Executors PODs are removed
    - Driver POD stays in COMPLETED state

    View Slide

  9. submit
    How it works: Client Mode (Spark v2.4)
    - Client Mode means:
    driver POD can be either:
    1) as Kubernetes POD
    2) or as a client outside of
    the K8s at all
    - In both cases a Driver POD
    must be routable from the
    Spark executor PODs
    Option 2
    Client + Spark
    Driver
    Option 1
    Client + Spark
    Driver
    apiserver
    scheduler
    apiserver
    Kubernetes Cluster
    executor 1 executor 2 executor 3
    Use cases:
    - Interactive mode Spark:
    shell, Jupyter Notebook

    View Slide

  10. Cluster Mode: Submit Job
    $ bin/spark-submit \
    --master k8s://https://: \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=3 \
    --conf spark.app.name=spark-pi \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=\
    local:///path/to/some-spark-job.jar
    * path in the Docker image of Driver/Executor container

    View Slide

  11. Client Mode: Submit Job
    $ bin/spark-submit \
    --master k8s://https://: \
    --deploy-mode client \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=3 \
    --conf spark.app.name=spark-pi \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=\
    file://$SPARK_HOME/path/to/some-spark-job.jar
    * path to dependencies points to a URI served by Driver's web-server

    View Slide

  12. Docker Images
    Spark Driver Image repo/spark-driver:v2.2.0-kubernetes-0.5.0
    Spark Executor Image repo/spark-executor:v2.2.0-kubernetes-0.5.0
    Spark Initialization Image repo/spark-init:v2.2.0-kubernetes-0.5.0
    Spark Staging Server Image repo/spark-resource-staging-server:v2.2.0-kubernetes-0.5.0
    Spark External Shuffle Service repo/spark-shuffle:v2.2.0-kubernetes-0.5.0
    PySpark Driver Image repo/spark-driver-py:v2.2.0-kubernetes-0.5.0
    PySpark Executor Image repo/spark-executor-py:v2.2.0-kubernetes-0.5.0
    SparkR Driver Image repo/spark-driver-r:v2.2.0-kubernetes-0.5.0
    SparkR Executor Image repo/spark-executor-r:v2.2.0-kubernetes-0.5.0
    One can easilly built own images using available script at Spark GitHub repo

    View Slide

  13. K8s Volume Mounts
    Volumes can be mounted to Driver & Executor PODs
    1.  hostPath: mounts a file or directory from the host node’s
    filesystem into a pod.
    2.  emptyDir: an initially empty volume created when a pod is
    assigned to a node.
    3.  persistentVolumeClaim: used to mount a PersistentVolume
    into a pod.
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
    name: spark-data-pvc
    labels:
    app: wikipedia-analyzer
    spec:
    accessModes:
    - ReadOnlyMany
    resources:
    requests:
    storage: 1Gi

    View Slide

  14. Future work for
    k8s
    •  Kerberos authentication
    •  Driver resilience for Spark Streaming
    applications
    •  POD template (mount arbitrary
    volumes, ConfigMaps)
    •  Better support to upload dependencies
    (jar) from client
    •  Dynamic resource allocation and
    external shuffle service

    View Slide

  15. •  Word Count
    •  Spark Shell / Zeppelin Notebook
    •  Google File Storage

    View Slide

  16. Demo 1: spark-shell
    POD POD POD
    *GCE VM - Google Compute Engine Virtual Machine
    Spark
    Executors
    ssh GCE VM*
    Spark
    Driver
    api-
    server
    K8s Node 1 K8s Node 2
    GCE VM*
    Kubernetes
    Cluster

    View Slide

  17. Demo 2: Apache Zeppelin
    •  Vesion 0.9.0 Snapshot brings NEW K8s support
    •  When Zeppelin runs in Pod, it creates pods for
    individual interpreter
    •  Key benefits are
    1.  Interpreter scale-out
    2.  Spark interpreter auto configure Spark on
    Kubernetes
    3.  Able to customize Kubernetes yaml file
    4.  Spark UI access

    View Slide

  18. To run 0.9.0-SNAPSHOT
    •  Docker image can be built from the master branch:
    mvn package -DskipTests -Pbuild-distr
    then build an image:
    gcr.io/spark-test-244110/zeppelin:0.9-SNAPSHOT (my image)
    •  kubectl appy -f zeppelin/k8s/zeppelin-server.yaml

    View Slide

  19. Demo 2: zeppelin-spark-k8s
    exe-
    POD
    exe-
    POD
    exe-
    POD
    Executors &
    driver
    GCE VM*
    api-
    server
    K8s Node 1 K8s Node 2
    GCE VM*
    Kubernetes Cluster
    Interp
    reter/
    Driver
    -POD
    Zeppelin
    -Server-
    POD
    port-forward to
    Zeppelin UI

    View Slide

  20. Other topics
    Kubernetes Operator
    for Spark from Google
    Jupyter Notebook
    with Scala in
    Kubernets client mode

    View Slide

  21. Thank you!
    Questions? Spark on Kubernetes
    Alexey Novakov
    Twitter:
    @alexey_novakov
    Email:
    novakov.alex at
    gmail.com

    View Slide

  22. images
    https://unsplash.com/photos/_nqApgG-QrY
    https://unsplash.com/photos/MAYsdoYpGuk
    https://en.wikipedia.org/wiki/Kubernetes#/media/File:Kubernetes.png

    View Slide