Slide 1

Slide 1 text

Pishen Tsai @ KKBOX Deploy your own Spark cluster in 4 minutes using sbt

Slide 2

Slide 2 text

KKBOX / spark-deployer ● SBT plugin. ● Productively used in KKBOX. ● 100% Scala. https://github.com/KKBOX/spark-deployer

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

destroy cluster submit job create cluster write the code compile & assembly

Slide 5

Slide 5 text

● spark-ec2 ● amazon emr (Elastic MapReduce) ● spark-deployer Solutions https://aws.amazon.com/elasticmapreduce/details/spark http://spark.apache.org/docs/latest/ec2-scripts.html spark-ec2: amazon emr:

Slide 6

Slide 6 text

● spark-ec2 ● amazon emr (Elastic MapReduce) ● spark-deployer Solutions

Slide 7

Slide 7 text

spark-ec2 write the code compile & assembly submit job create cluster destroy cluster sbt scp & ssh spark-ec2 spark-ec2

Slide 8

Slide 8 text

spark-ec2’s commands $ sbt assembly $ spark-ec2 -k awskey -i ~/.ssh/awskey.pem -r us-west-2 -z us-west-2a --vpc-id=vpc-a28d24c7 -- subnet-id=subnet-4eb27b39 -s 2 -t c4.xlarge -m m4.large --spark-version=1.5.2 --copy-aws- credentials launch my-spark-cluster $ scp -i ~/.ssh/awskey.pem target/scala-2.10 /my_job-assembly-0.1.jar root@:~/job.jar $ ssh -i ~/.ssh/awskey.pem root@ '. /spark/bin/spark-submit --class mypackage.Main --master spark://:7077 --executor- memory 6G job.jar arg0' $ spark-ec2 -r us-west-2 destroy my-spark- cluster

Slide 9

Slide 9 text

spark-ec2 write the code compile & assembly submit job create cluster destroy cluster sbt spark-ec2 spark-ec2 scp & ssh make

Slide 10

Slide 10 text

spark-ec2’s bad parts Need to install sbt and spark-ec2. Need to design and maintain Makefiles. Slow startup time (~20mins).

Slide 11

Slide 11 text

● spark-ec2 ● amazon emr (Elastic MapReduce) ● spark-deployer Solutions

Slide 12

Slide 12 text

emr write the code compile & assembly submit job create cluster destroy cluster sbt emr

Slide 13

Slide 13 text

emr’s commands $ sbt assembly $ aws emr create-cluster --name my-spark-cluster --release-label emr-4.2.0 --instance-type m3. xlarge --instance-count 2 --applications Name=Spark --ec2-attributes KeyName=awskey --use- default-roles $ aws emr put --cluster-id j-2AXXXXXXGAPLF --key- pair-file ~/.ssh/mykey.pem --src target/scala- 2.10/my_job-assembly-0.1.jar --dest /home/hadoop/job.jar $ aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name=my-emr,ActionOnFailure= CONTINUE,Args=[--executor-memory,13G,--class, mypackage.Main,/home/hadoop/job.jar,arg0] $ aws emr terminate-clusters --cluster-id j-2AXX

Slide 14

Slide 14 text

emr write the code compile & assembly submit job create cluster destroy cluster sbt emr make

Slide 15

Slide 15 text

emr’s bad parts Need to install sbt and emr. Need to design and maintain Makefiles. Spark’s version is old. Restricted machine type.

Slide 16

Slide 16 text

Since sbt is a powerful build tool itself, why don’t we let it handle all the dirty works for us?

Slide 17

Slide 17 text

● spark-ec2 ● amazon emr (Elastic MapReduce) ● spark-deployer Solutions

Slide 18

Slide 18 text

spark-deployer write the code compile & assembly submit job create cluster destroy cluster sbt

Slide 19

Slide 19 text

spark-deployer’s commands $ sbt "sparkCreateCluster 2" $ sbt "sparkSubmitJob arg0" $ sbt "sparkDestroyCluster"

Slide 20

Slide 20 text

spark-deployer’s good parts Need to install only sbt. No Makefile. Easy to use. Let you focus on your code. Fast and parallel startup (~4mins). Dynamic scale out. Flexible design.

Slide 21

Slide 21 text

How to use it?

Slide 22

Slide 22 text

Prerequisites ● java ● sbt ● export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=... http://www.scala-sbt.org/0.13/tutorial/Manual-Installation.html#Unix sbt installation

Slide 23

Slide 23 text

Demo

Slide 24

Slide 24 text

● Report issues. ● Join our gitter channel. ● Send pull requests. https://github.com/KKBOX/spark-deployer Give it a try, and share! KKBOX / spark-deployer

Slide 25

Slide 25 text

Thank you Pishen Tsai @ KKBOX KKBOX / spark-deployer