Upgrade to Pro — share decks privately, control downloads, hide ads and more …

4 Useful Tips for Running Spark!

4 Useful Tips for Running Spark!

第2回“Learning Spark”読書会 2015/03/29 での、LT 資料です。

Takumi Yoshida

March 29, 2015
Tweet

More Decks by Takumi Yoshida

Other Decks in Programming

Transcript

  1. Hello ! Αͩ͠Ͱ͢ɻ • ٢ా ঊɹ(@yoshi0309) • SIer ͰɺΤϯδχΞͱͯ͠ಇ͍͍ͯ·͢ɻ •

    Apache Solr / Elasticsearch / AWS / Apache Spark / Hadoop / ػցֶश … • Spark ྺ͸൒೥͙Β͍ʁ
  2. 4 Useful Tips 1. Spark on Amazon EMR ! 2.

    Φϓγϣϯͷॱ൪͸ݫີʹ 3. RDD ͸ωετͰ͖ͳ͍ 4. lookup ϝιου͸஗͍ʂ ஫ҙʂ: ຊεϥΠυ͸ Spark 1.2.1 Λલఏʹ͍ͯ͠·͢ɻ
  3. 1. Spark on Amazon EMR ! • ϩʔΧϧͰಈ͘͜ͱΛ֬ೝͨ͠ΒɺΫϥελ Ͱಈ͔ͯ͠Έͯ଎͞Λମײ͍ͨ͠Ͱ͢ΑͶʁ •

    ͱ͸͍͑ɺखݩʹԿ୆΋αʔόʔ͸ແ͍ɺ΋ ͘͠͸ɺԿ୆΋ηοτΞοϓ͢Δ࣌ؒ͸ͳ͍ • Hadoop ( YARN ) Ͱಈ͔͍ͨ͠
  4. AWS CLI ΛೖΕ࣮ͯߦʂҎ্ʂ aws emr create-cluster --region ap-northeast-1 --name SparkCluster

    --ami-version 3.3.1 --no-auto- terminate --service-role EMR_DefaultRole --instance-groups InstanceCount=1,BidPrice=0.03,Name=sparkmaster,InstanceGroupType=MASTER,Inst anceType=m1.large InstanceCount=5,BidPrice=0.03,Name=sparkworker,InstanceGroupType=CORE,Instan ceType=m1.large --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,KeyName=jpkeypair --applications Name=HIVE Name=Pig Name=Hue Name=Ganglia --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=[-v,1.2.0.a,-g] Path=s3://hogefoobar-takumiyoshida/bootstrap/bootstrap-urge.sh * ࣮ࡍ͸ҰߦͰ͢ɻ
  5. ࣗ࡞ΞϓϦͷಋೖ͸ bootstrap εΫϦϓτΛॻ͍ͯࣗಈԽʂ #!/bin/bash -ex # run only on MASTER

    node. if grep -Fq "\"isMaster\": true" /mnt/var/lib/info/instance.json ; then # open the ssh port... perl -pi -e 's/^#?Port 22$/Port 22\nPort 443/' /etc/ssh/sshd_config /etc/init.d/sshd restart # install scala and sbt. # scala is in standard bootstrap action of EMR. so you dont need to install by yourself. wget https://dl.bintray.com/sbt/rpm/sbt-0.13.7.rpm sudo rpm -ivh sbt-0.13.7.rpm ͖ͭͮ·͢ɻ
  6. # install solr and start. wget http://ftp.meisei-u.ac.jp/mirror/apache/dist/lucene/solr/4.10.3/ solr-4.10.3.zip unzip solr-4.10.3.zip

    mv solr-4.10.3 /home/hadoop/solr cp -r /home/hadoop/solr/example/solr/collection1 /home/hadoop/solr/ example/solr/collection2 echo "name=collection2" > /home/hadoop/solr/example/solr/collection2/ core.properties /home/hadoop/solr/bin/solr start # download source. sudo yum install git -y git clone https://bitbucket.org/yoshi0309/xxxxx.git cd urge-recommend-u2i git checkout develop # build and package. sbt package fi ͖ͭͮͰ͢ɻ
  7. μϝͳྫ bin/spark-submit --master yarn-cluster --jars (লུ) -—class Recommend MyRecommend.jar s3n://abc-takumiyoshida/datasets/

    s3n://abc-takumiyoshida/rates.txt --driver-memory 2g --num-executors 4 --executor-memory 4g --executor-cores 2
  8. ਖ਼ղ͸ͪ͜Β bin/spark-submit \ --class <main-class> --master <master-url> \ --deploy-mode <deploy-mode>

    \ --conf <key>=<value> \ ... # other options <application-jar> \ [application-arguments]
  9. 3. RDD ͸ωετͰ͖ͳ͍ • RDD ͷҰํΛ for ΍ map Ͱϧʔϓॲཧͤͯ͞

    ͦͷதͰ΋͏Ұํͷ RDD Λࢀর͢Δ͜ͱ͸Ͱ ͖·ͤΜɻ͵ΔΆग़·͢ɻ • Exception ͔Β͸ݪҼ͕τϨʔεͰ͖·ͤΜɻ • ஌Βͳ͚Ε͹ɺܰ̍͘೔௵͞Ε·͢ɻʢ࣮࿩ʣ
  10. ؆୯ͳྫ scala> val r1 = sc.textFile("restaurants.csv") r1: org.apache.spark.rdd.RDD[String] = restaurants.csv

    MappedRDD[3] at textFile at <console>:12 scala> val r2 = sc.textFile("ratings.csv") r2: org.apache.spark.rdd.RDD[String] = ratings.csv MappedRDD[5] at textFile at <console>:12 scala> for (r <- r1) { | r2 take(5) foreach(println) | } 15/02/05 08:52:16 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NullPointerException at org.apache.spark.rdd.RDD.firstParent(RDD.scala:1239) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) (ҎԼলུ)
  11. 4. lookup ϝιου͸஗͍ʂ • PairRDDFunction ͸ศརͰ͢ɻKey-Value ܕͷ σʔλલఏͱ͢Δ͜ͱͰɺૉͷ RDD ʹ͸ແ͍

    ศརͳϝιου͕ଟ͍Ͱ͢ɻ • ͦͷதͷ lookup ϝιου͸ɹɹɹɹɹɹɹ ͻ͡ΐʔʔʔʔʔʔʔʔʔʹ஗͍ͷͰ஫ҙ͕ ඞཁͰ͢ɻ
  12. ศརͳ PairRDDFunctions • join / leftOuterJoin / rightOuterJoin - ̎ͭͷ

    RDD Λ join ͢Δɻ • countByKey - Key ͷ஋͝ͱʹ݅਺Λ਺͑Δɻ • reductByKey - Key ͷ஋͝ͱʹ reduce ͢Δɻ • groupByKey - Keyͷ஋͝ͱͰάϧʔϐϯά͢Δ (value ͕ Seq ʹ·ͱ·Δ) • lookup - Key ͷ஋Ͱ value Λݕࡧ͢Δ
  13. ·ͱΊ 1. Spark on Amazon EMR ! 2. Φϓγϣϯͷॱ൪͸ݫີʹ 3.

    RDD ͸ωετͰ͖ͳ͍ 4. lookup ϝιου͸஗͍ʂ