Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deploy your own Spark cluster in 4 minutes usin...

Pishen Tsai
December 05, 2015

Deploy your own Spark cluster in 4 minutes using sbt.

Pishen Tsai

December 05, 2015
Tweet

More Decks by Pishen Tsai

Other Decks in Programming

Transcript

  1. KKBOX / spark-deployer • SBT plugin. • Productively used in

    KKBOX. • 100% Scala. https://github.com/KKBOX/spark-deployer
  2. • spark-ec2 • amazon emr (Elastic MapReduce) • spark-deployer Solutions

    https://aws.amazon.com/elasticmapreduce/details/spark http://spark.apache.org/docs/latest/ec2-scripts.html spark-ec2: amazon emr:
  3. spark-ec2 write the code compile & assembly submit job create

    cluster destroy cluster sbt scp & ssh spark-ec2 spark-ec2
  4. spark-ec2’s commands $ sbt assembly $ spark-ec2 -k awskey -i

    ~/.ssh/awskey.pem -r us-west-2 -z us-west-2a --vpc-id=vpc-a28d24c7 -- subnet-id=subnet-4eb27b39 -s 2 -t c4.xlarge -m m4.large --spark-version=1.5.2 --copy-aws- credentials launch my-spark-cluster $ scp -i ~/.ssh/awskey.pem target/scala-2.10 /my_job-assembly-0.1.jar root@<copy-master-ip- by-yourself>:~/job.jar $ ssh -i ~/.ssh/awskey.pem root@<master-ip> '. /spark/bin/spark-submit --class mypackage.Main --master spark://<master-ip>:7077 --executor- memory 6G job.jar arg0' $ spark-ec2 -r us-west-2 destroy my-spark- cluster
  5. spark-ec2 write the code compile & assembly submit job create

    cluster destroy cluster sbt spark-ec2 spark-ec2 scp & ssh make
  6. spark-ec2’s bad parts Need to install sbt and spark-ec2. Need

    to design and maintain Makefiles. Slow startup time (~20mins).
  7. emr’s commands $ sbt assembly $ aws emr create-cluster --name

    my-spark-cluster --release-label emr-4.2.0 --instance-type m3. xlarge --instance-count 2 --applications Name=Spark --ec2-attributes KeyName=awskey --use- default-roles $ aws emr put --cluster-id j-2AXXXXXXGAPLF --key- pair-file ~/.ssh/mykey.pem --src target/scala- 2.10/my_job-assembly-0.1.jar --dest /home/hadoop/job.jar $ aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=Spark,Name=my-emr,ActionOnFailure= CONTINUE,Args=[--executor-memory,13G,--class, mypackage.Main,/home/hadoop/job.jar,arg0] $ aws emr terminate-clusters --cluster-id j-2AXX
  8. emr write the code compile & assembly submit job create

    cluster destroy cluster sbt emr make
  9. emr’s bad parts Need to install sbt and emr. Need

    to design and maintain Makefiles. Spark’s version is old. Restricted machine type.
  10. Since sbt is a powerful build tool itself, why don’t

    we let it handle all the dirty works for us?
  11. spark-deployer’s good parts Need to install only sbt. No Makefile.

    Easy to use. Let you focus on your code. Fast and parallel startup (~4mins). Dynamic scale out. Flexible design.
  12. Prerequisites • java • sbt • export AWS_ACCESS_KEY_ID=... export AWS_SECRET_ACCESS_KEY=...

    http://www.scala-sbt.org/0.13/tutorial/Manual-Installation.html#Unix sbt installation
  13. • Report issues. • Join our gitter channel. • Send

    pull requests. https://github.com/KKBOX/spark-deployer Give it a try, and share! KKBOX / spark-deployer