Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop vs. Spark Today

Hadoop vs. Spark Today

In this slide, you will find the introduction to hadoop and spark. Why we use hadoop, spark. What is RDD in spark? How to operate RDD? Anything special in spark?

906495aee953d1a6dc3d661d28da0081?s=128

Erica Li

July 11, 2016
Tweet

Transcript

  1. Hadoop vs. Spark Erica Li 1

  2. 環境預備 • Google search: QuickStart Download for CDH 5.7 ◦

    Platform: Virtual Box ◦ Save at Desktop • Google search: Download virtualbox ◦ Platform: your PC ◦ Then install it 2
  3. Erica Li • shrimp_li • ericalitw ... • Data Scientist

    • inBOUND CTO & Co-Founder • Girls in Tech/Women Who Code • Taiwan Spark User Group Founder 3
  4. SAS, R, SPSS SAS, Java, Python Python Python, SAS... 4

  5. Taiwan Spark User Group 5

  6. 課程大綱 • 巨量資料時代的來臨 • Hadoop的時代 • Spark的崛起 • Hadoop與Spark的殊死戰 6

  7. 巨量資料時代的來臨 7

  8. 8 Exploration on the Big Data frontier - Tim Smith

    https://www.youtube.com/watch? time_continue=2&v=j-0cUmUyb-Y
  9. Big data is high volume, high velocity, and/or high variety

    information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012 9
  10. http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs 10

  11. Big Data Problem, when? • Too many bytes (volume 資料量)

    • Too high a rate (velocity 傳輸速度) • Too many sources (variety 資料類型) • Not scalable analysis ◦ Human intensive ◦ Computer intensive 11
  12. Sources of Big Data 12 Learning Transportation Retail Health Goverment

    Entertainment Finance Social
  13. None
  14. Hadoop的時代 14

  15. 15

  16. 16

  17. 為甚麼要用Hadoop 17

  18. Hadoop平台 • 高資料通量處理平台 • 讓使用者簡易撰寫並處理巨量資料的軟體 • 只需學會定義 map 與 reduce

    該做什麼事 HDFS存放 非結構化資料 處理後結構資料 MAP Reduce MAP MAP 18
  19. Hadoop生態系 19

  20. Mike Gualtieri 20 What is Hadoop? https://www.youtube.com/watch?v=4DgTLaFNQq0

  21. What is Hadoop • A framework that allows for the

    distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org 21
  22. Question So, Hadoop is a framework allows for the distributed

    processing of 1) small data? 2) large data? 22
  23. Ans Large data # It is also capable of processing

    small one. However, to experience the power of hadoop, one needs to have data in TB's. 23
  24. http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview 24

  25. Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode

    Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager 25
  26. Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager

    Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager 26
  27. Hadoop Setup (AWS) 27

  28. What we need? • Credit card • Ubuntu Server 14.04

    LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz 28
  29. Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted 29

  30. 1-Download Java 7 on all machines #nn, dn1, dn2 sudo

    apt-get install -y python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install openjdk-7-jdk # this is for java8 #sudo apt-get install oracle-java8-installer 30
  31. 2-Modify /etc/hosts #Then, add all machines' ip and hostname on

    it. sudo vim /etc/hosts 31
  32. 3-Genkey for all machines chmod 644 ~/.ssh/authorized_keys # copy .pem

    from local system to nn scp -i ~/Downloads/StarkTech.pem ~/Downloads/StarkTech.pem ubuntu@<nn-public-ip>:~/ cp ~/StarkTech.pem ~/.ssh/ chmod 400 ~/.ssh/StarkTech.pem # create the public fingerprint on namenode ssh-keygen -f ~/.ssh/id_rsa -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # scp .pem to dn1 & dn2 scp -i ~/.ssh/StarkTech.pem ~/.ssh/StarkTech.pem ubuntu@<dn1-public-ip>:~/.ssh/ # copy the public fingerprint to each datanodes cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn1 'cat >> ~/.ssh/authorized_keys' cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn2 'cat >> ~/.ssh/authorized_keys' 32
  33. 4-Download it and install them # at nn cd ~

    wget http://ftp.twaren.net/Unix/Web/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz scp ~/hadoop-2.7.1.tar.gz dn1:~/ scp ~/hadoop-2.7.1.tar.gz dn2:~/ tar -zxvf hadoop-2.7.1.tar.gz sudo mv hadoop-2.7.1 /usr/local cd /usr/local sudo mv hadoop-2.7.1 hadoop vim ~/.profile 33
  34. 5-Modify variables export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_PREFIX=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

    #Set JAVA_HOME #java-8-oracle for java8 export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Add Hadoop bin/ directory to path export PATH=$PATH:$HADOOP_PREFIX/bin 34
  35. 6-Config setting cd /usr/local/hadoop/etc sudo apt-get install git git clone

    https://github.com/wlsherica/StarkTechnology.git mv StarkTechnology/hadoop-confg/ . rm -rf StarkTechnology mv hadoop hadoop_ori mv hadoop-config hadoop 35
  36. 7-Make sure all config are on # add JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 vim

    hadoop/hadoop-env.sh # dfs.replication=3 vim hadoop/hdfs-site.xml # check fs.default.name vim hadoop/core-site.xml 36
  37. 8-Check slaves vim hadoop/slaves +> dn1 +> dn2 -> localhost

    37
  38. 9-scp them all scp -r /usr/local/hadoop dn2:/usr/local scp -r /usr/local/hadoop

    dn1:/usr/local scp ~/.profile dn1:~/ scp ~/.profile dn2:~/ hadoop namenode -format 38
  39. 10-Start it! $HADOOP_HOME/sbin/start-dfs.sh jps ssh dn1 'jps' ssh dn1 'jps'

    #if you want to stop them $HADOOP_HOME/sbin/stop-dfs.sh 39
  40. Hadoop的瓶頸 40

  41. Hadoop日趨成熟與瓶頸 • Not fit for small data • Potential stability

    issues ◦ open source platform ◦ latest stable version • General limitations ◦ Hadoop is not the only answer ◦ data collection, aggregation and integration 41
  42. 從Hadoop到Spark 42

  43. 43

  44. • Hadoop ◦ A full stack MPP system with both

    big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers 44
  45. http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview 45

  46. HDFS NameNode DataNode DataNode DataNode DataNode HDFS Cluster File Block

    Block Block Block 46
  47. MapReduce Deer Bear River Car Car River Dear Car Bear

    Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/ 47
  48. Spark vs. Hadoop MapReduce Run programs up to 100x faster

    than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/ 48
  49. Hadoop vs. Spark? 49

  50. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark 50
  51. WordCount - MapReduce # mapper.py for line in sys.stdin: line

    = line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1) # reducer.py from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count) 51
  52. WordCount - Spark import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])).collect() 52
  53. Apache Spark™ is a powerful open source processing engine built

    around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
  54. Spark的歷史演進 54

  55. A Brief History 2004 2006 2008 2010 2012 2014 2016

    MapReduce Paper by Google Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0 55
  56. Spark生態圈 56

  57. https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

  58. 58

  59. Spark的未來與展望 59

  60. Spark 2.0 • Easier, faster, smarter ◦ Easier SQL API

    ◦ All 99-TPC-DS queries • DataFrames & Datasets in scala/java • ML pipeline persistence • Algorithms in R ◦ GLM, K-Means, survival analysis, NB, etc 60
  61. 61

  62. Spark Installation 62

  63. Training Materials • Cloudera VM ◦ Cloudera CDH 5.7 •

    Spark 1.6.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox 63
  64. Cloudera Quick Start VM • QuickStart Downloads for CDH 5.7

    • CDH5 and Cloudera Manager 5 • Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera 64
  65. Spark Installation • Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.2 /spark-1.6.2-bin-hadoop2.6.tgz • Tar it

    then mv it tar zxvf spark-1.6.2-bin-hadoop2.6.tgz cd spark-1.6.2-bin-hadoop2.6 v2.10.X 7+ 2.6+ 3.1+
  66. Let’s do it 66

  67. 67

  68. 68

  69. 69

  70. 70

  71. 71

  72. Where is Spark? • $ whereis spark • $ ls

    /usr/lib/spark • $ /usr/lib/spark/bin/pyspark • > exit() 72
  73. • Python textF=sc.textFile("/usr/lib/spark/LICENSE") textF=sc.textFile("file:///usr/lib/spark/LICENSE") textF.count() textF.take(2) spk_line=textF.filter(lambda line: "Spark" in

    line) spk_line.count() 73
  74. Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]

    --jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark 74
  75. Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)

    val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI 75
  76. master URLs • local, local[K], local[*] • spark://HOST:PORT • mesos://HOST:PORT

    • yarn-client • yarn-cluster 76
  77. Spark Standalone Mode • Launch standalone cluster ◦ master ◦

    slaves ◦ public key 77
  78. Which one is better? 78

  79. Spark RDD的原理 79

  80. “Spark provides is a resilient distributed dataset (RDD), which is

    a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ... 80
  81. RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)

    sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ... 81
  82. How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")

    val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) 82
  83. Key-Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,

    1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey 83
  84. Narrow Dependencies map, filter 84

  85. Narrow Dependencies union 85

  86. Wide Dependencies groupByKey, reduceByKey 86

  87. Stage3 Stage A groupBy B C map D E union

    F G join Stage1 Stage2 87
  88. lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)

    # Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance 88
  89. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()
  90. Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])

    broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value
  91. # Word Count import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5) 91
  92. Get Your Hands Dirty 92

  93. 93 # raw data wordsList = ["apple", "banana", "strawberry", "lemon",

    "apple", "banana", "apple", "apple", "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): return word + 's' print makePlural('banana') # 將上述產生的python function makePlural傳遞到map()中達到每個水果字串字尾+s pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # 利用lambda匿名函數,讓每個進來的水果字串字尾+s, 不用上述的python function pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect()
  94. # 利用map()計算每個水果單字的length(), 請用len()這個python自帶的長度函數 pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths #

    Hint: We can create the pair RDD using the map() transformation with a lambda() function to create a new RDD. 請建一個pairRDD, 每一個水果單字進來都給他一個數字1 wordPairs = wordsRDD.<FILL IN> print wordPairs.collect() # 利用mapValues()計算每一個水果的出現的次數 wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # 請用reduceByKey()針對水果字串wordPairs做加總,需要得到每個水果對應的次數 wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # 將上述的map, reduceByKey, RDD的過程以一行表示, 並且用collect(), print之需得到最後 wordcount的答案 wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected 94
  95. Best Practice 95

  96. # Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey 96
  97. # Don’t copy all elements to driver • Scala val

    values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling 97
  98. # Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",

    # Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable") 98
  99. # Number of data partitions? • Spark application UI •

    Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python 99
  100. HadoopCon 2016 9/9-9/10 8th Annual Hadoop Meetup in Taiwan http://2016.hadoopcon.org/wp/

    100
  101. Advanced 101

  102. Components • Data storage ◦ HDFS, hadoop compatible • API

    ◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos -> $./bin/spark-shell --master mesos://host:5050 ◦ Yarn -> $./bin/spark-shell --master --deply-mode client ◦ AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc
  103. Cluster Object Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  104. Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  105. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  106. Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker

    Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task