$30 off During Our Annual Pro Sale. View Details »

Hadoop vs. Spark Today

Hadoop vs. Spark Today

In this slide, you will find the introduction to hadoop and spark. Why we use hadoop, spark. What is RDD in spark? How to operate RDD? Anything special in spark?

Erica Li

July 11, 2016
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. 環境預備 • Google search: QuickStart Download for CDH 5.7 ◦

    Platform: Virtual Box ◦ Save at Desktop • Google search: Download virtualbox ◦ Platform: your PC ◦ Then install it 2
  2. Erica Li • shrimp_li • ericalitw ... • Data Scientist

    • inBOUND CTO & Co-Founder • Girls in Tech/Women Who Code • Taiwan Spark User Group Founder 3
  3. 8 Exploration on the Big Data frontier - Tim Smith

    https://www.youtube.com/watch? time_continue=2&v=j-0cUmUyb-Y
  4. Big data is high volume, high velocity, and/or high variety

    information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012 9
  5. Big Data Problem, when? • Too many bytes (volume 資料量)

    • Too high a rate (velocity 傳輸速度) • Too many sources (variety 資料類型) • Not scalable analysis ◦ Human intensive ◦ Computer intensive 11
  6. 15

  7. 16

  8. What is Hadoop • A framework that allows for the

    distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org 21
  9. Question So, Hadoop is a framework allows for the distributed

    processing of 1) small data? 2) large data? 22
  10. Ans Large data # It is also capable of processing

    small one. However, to experience the power of hadoop, one needs to have data in TB's. 23
  11. Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode

    Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager 25
  12. Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager

    Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager 26
  13. What we need? • Credit card • Ubuntu Server 14.04

    LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz 28
  14. 1-Download Java 7 on all machines #nn, dn1, dn2 sudo

    apt-get install -y python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install openjdk-7-jdk # this is for java8 #sudo apt-get install oracle-java8-installer 30
  15. 3-Genkey for all machines chmod 644 ~/.ssh/authorized_keys # copy .pem

    from local system to nn scp -i ~/Downloads/StarkTech.pem ~/Downloads/StarkTech.pem ubuntu@<nn-public-ip>:~/ cp ~/StarkTech.pem ~/.ssh/ chmod 400 ~/.ssh/StarkTech.pem # create the public fingerprint on namenode ssh-keygen -f ~/.ssh/id_rsa -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # scp .pem to dn1 & dn2 scp -i ~/.ssh/StarkTech.pem ~/.ssh/StarkTech.pem ubuntu@<dn1-public-ip>:~/.ssh/ # copy the public fingerprint to each datanodes cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn1 'cat >> ~/.ssh/authorized_keys' cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn2 'cat >> ~/.ssh/authorized_keys' 32
  16. 4-Download it and install them # at nn cd ~

    wget http://ftp.twaren.net/Unix/Web/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz scp ~/hadoop-2.7.1.tar.gz dn1:~/ scp ~/hadoop-2.7.1.tar.gz dn2:~/ tar -zxvf hadoop-2.7.1.tar.gz sudo mv hadoop-2.7.1 /usr/local cd /usr/local sudo mv hadoop-2.7.1 hadoop vim ~/.profile 33
  17. 5-Modify variables export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_PREFIX=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

    #Set JAVA_HOME #java-8-oracle for java8 export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Add Hadoop bin/ directory to path export PATH=$PATH:$HADOOP_PREFIX/bin 34
  18. 6-Config setting cd /usr/local/hadoop/etc sudo apt-get install git git clone

    https://github.com/wlsherica/StarkTechnology.git mv StarkTechnology/hadoop-confg/ . rm -rf StarkTechnology mv hadoop hadoop_ori mv hadoop-config hadoop 35
  19. 7-Make sure all config are on # add JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 vim

    hadoop/hadoop-env.sh # dfs.replication=3 vim hadoop/hdfs-site.xml # check fs.default.name vim hadoop/core-site.xml 36
  20. 9-scp them all scp -r /usr/local/hadoop dn2:/usr/local scp -r /usr/local/hadoop

    dn1:/usr/local scp ~/.profile dn1:~/ scp ~/.profile dn2:~/ hadoop namenode -format 38
  21. 10-Start it! $HADOOP_HOME/sbin/start-dfs.sh jps ssh dn1 'jps' ssh dn1 'jps'

    #if you want to stop them $HADOOP_HOME/sbin/stop-dfs.sh 39
  22. Hadoop日趨成熟與瓶頸 • Not fit for small data • Potential stability

    issues ◦ open source platform ◦ latest stable version • General limitations ◦ Hadoop is not the only answer ◦ data collection, aggregation and integration 41
  23. 43

  24. • Hadoop ◦ A full stack MPP system with both

    big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers 44
  25. MapReduce Deer Bear River Car Car River Dear Car Bear

    Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/ 47
  26. Spark vs. Hadoop MapReduce Run programs up to 100x faster

    than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/ 48
  27. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark 50
  28. WordCount - MapReduce # mapper.py for line in sys.stdin: line

    = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1) # reducer.py from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count) 51
  29. WordCount - Spark import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])).collect() 52
  30. Apache Spark™ is a powerful open source processing engine built

    around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
  31. A Brief History 2004 2006 2008 2010 2012 2014 2016

    MapReduce Paper by Google Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0 55
  32. 58

  33. Spark 2.0 • Easier, faster, smarter ◦ Easier SQL API

    ◦ All 99-TPC-DS queries • DataFrames & Datasets in scala/java • ML pipeline persistence • Algorithms in R ◦ GLM, K-Means, survival analysis, NB, etc 60
  34. 61

  35. Training Materials • Cloudera VM ◦ Cloudera CDH 5.7 •

    Spark 1.6.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox 63
  36. Cloudera Quick Start VM • QuickStart Downloads for CDH 5.7

    • CDH5 and Cloudera Manager 5 • Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera 64
  37. Spark Installation • Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.2 /spark-1.6.2-bin-hadoop2.6.tgz • Tar it

    then mv it tar zxvf spark-1.6.2-bin-hadoop2.6.tgz cd spark-1.6.2-bin-hadoop2.6 v2.10.X 7+ 2.6+ 3.1+
  38. 67

  39. 68

  40. 69

  41. 70

  42. 71

  43. Where is Spark? • $ whereis spark • $ ls

    /usr/lib/spark • $ /usr/lib/spark/bin/pyspark • > exit() 72
  44. Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]

    --jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark 74
  45. Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)

    val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI 75
  46. “Spark provides is a resilient distributed dataset (RDD), which is

    a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ... 80
  47. RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)

    sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ... 81
  48. How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")

    val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) 82
  49. Key-Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,

    1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey 83
  50. Stage3 Stage A groupBy B C map D E union

    F G join Stage1 Stage2 87
  51. lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)

    # Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance 88
  52. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()
  53. Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])

    broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value
  54. # Word Count import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5) 91
  55. 93 # raw data wordsList = ["apple", "banana", "strawberry", "lemon",

    "apple", "banana", "apple", "apple", "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): return word + 's' print makePlural('banana') # 將上述產生的python function makePlural傳遞到map()中達到每個水果字串字尾+s pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # 利用lambda匿名函數,讓每個進來的水果字串字尾+s, 不用上述的python function pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect()
  56. # 利用map()計算每個水果單字的length(), 請用len()這個python自帶的長度函數 pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths #

    Hint: We can create the pair RDD using the map() transformation with a lambda() function to create a new RDD. 請建一個pairRDD, 每一個水果單字進來都給他一個數字1 wordPairs = wordsRDD.<FILL IN> print wordPairs.collect() # 利用mapValues()計算每一個水果的出現的次數 wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # 請用reduceByKey()針對水果字串wordPairs做加總,需要得到每個水果對應的次數 wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # 將上述的map, reduceByKey, RDD的過程以一行表示, 並且用collect(), print之需得到最後 wordcount的答案 wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected 94
  57. # Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey 96
  58. # Don’t copy all elements to driver • Scala val

    values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling 97
  59. # Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",

    # Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable") 98
  60. # Number of data partitions? • Spark application UI •

    Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python 99
  61. Components • Data storage ◦ HDFS, hadoop compatible • API

    ◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos -> $./bin/spark-shell --master mesos://host:5050 ◦ Yarn -> $./bin/spark-shell --master --deply-mode client ◦ AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc
  62. Cluster Object Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  63. Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  64. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  65. Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker

    Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task