Hadoop vs. Spark Today

Hadoop vs. Spark Erica Li 1

環境預備 • Google search: QuickStart Download for CDH 5.7 ◦
Platform: Virtual Box ◦ Save at Desktop • Google search: Download virtualbox ◦ Platform: your PC ◦ Then install it 2

Erica Li • shrimp_li • ericalitw ... • Data Scientist
• inBOUND CTO & Co-Founder • Girls in Tech/Women Who Code • Taiwan Spark User Group Founder 3

SAS, R, SPSS SAS, Java, Python Python Python, SAS... 4

Taiwan Spark User Group 5

課程大綱 • 巨量資料時代的來臨 • Hadoop的時代 • Spark的崛起 • Hadoop與Spark的殊死戰 6

巨量資料時代的來臨 7

8 Exploration on the Big Data frontier - Tim Smith
https://www.youtube.com/watch? time_continue=2&v=j-0cUmUyb-Y

Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012 9

http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs 10

Big Data Problem, when? • Too many bytes (volume 資料量)
• Too high a rate (velocity 傳輸速度) • Too many sources (variety 資料類型) • Not scalable analysis ◦ Human intensive ◦ Computer intensive 11

Sources of Big Data 12 Learning Transportation Retail Health Goverment
Entertainment Finance Social

Hadoop的時代 14

為甚麼要用Hadoop 17

Hadoop平台 • 高資料通量處理平台 • 讓使用者簡易撰寫並處理巨量資料的軟體 • 只需學會定義 map 與 reduce
該做什麼事 HDFS存放非結構化資料處理後結構資料 MAP Reduce MAP MAP 18

Hadoop生態系 19

Mike Gualtieri 20 What is Hadoop? https://www.youtube.com/watch?v=4DgTLaFNQq0

What is Hadoop • A framework that allows for the
distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org 21

Question So, Hadoop is a framework allows for the distributed
processing of 1) small data? 2) large data? 22

Ans Large data # It is also capable of processing
small one. However, to experience the power of hadoop, one needs to have data in TB's. 23

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview 24

Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode
Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager 25

Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager
Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager 26

Hadoop Setup (AWS) 27

What we need? • Credit card • Ubuntu Server 14.04
LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz 28

Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted 29

1-Download Java 7 on all machines #nn, dn1, dn2 sudo
apt-get install -y python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install openjdk-7-jdk # this is for java8 #sudo apt-get install oracle-java8-installer 30

2-Modify /etc/hosts #Then, add all machines' ip and hostname on
it. sudo vim /etc/hosts 31

3-Genkey for all machines chmod 644 ~/.ssh/authorized_keys # copy .pem
from local system to nn scp -i ~/Downloads/StarkTech.pem ~/Downloads/StarkTech.pem ubuntu@<nn-public-ip>:~/ cp ~/StarkTech.pem ~/.ssh/ chmod 400 ~/.ssh/StarkTech.pem # create the public fingerprint on namenode ssh-keygen -f ~/.ssh/id_rsa -t rsa -P "" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # scp .pem to dn1 & dn2 scp -i ~/.ssh/StarkTech.pem ~/.ssh/StarkTech.pem ubuntu@<dn1-public-ip>:~/.ssh/ # copy the public fingerprint to each datanodes cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn1 'cat >> ~/.ssh/authorized_keys' cat ~/.ssh/id_rsa.pub | ssh -i ~/.ssh/StarkTech.pem dn2 'cat >> ~/.ssh/authorized_keys' 32

4-Download it and install them # at nn cd ~
wget http://ftp.twaren.net/Unix/Web/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz scp ~/hadoop-2.7.1.tar.gz dn1:~/ scp ~/hadoop-2.7.1.tar.gz dn2:~/ tar -zxvf hadoop-2.7.1.tar.gz sudo mv hadoop-2.7.1 /usr/local cd /usr/local sudo mv hadoop-2.7.1 hadoop vim ~/.profile 33

5-Modify variables export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin export HADOOP_PREFIX=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
#Set JAVA_HOME #java-8-oracle for java8 export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 # Add Hadoop bin/ directory to path export PATH=$PATH:$HADOOP_PREFIX/bin 34

6-Config setting cd /usr/local/hadoop/etc sudo apt-get install git git clone
https://github.com/wlsherica/StarkTechnology.git mv StarkTechnology/hadoop-confg/ . rm -rf StarkTechnology mv hadoop hadoop_ori mv hadoop-config hadoop 35

7-Make sure all config are on # add JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 vim
hadoop/hadoop-env.sh # dfs.replication=3 vim hadoop/hdfs-site.xml # check fs.default.name vim hadoop/core-site.xml 36

8-Check slaves vim hadoop/slaves +> dn1 +> dn2 -> localhost
37

9-scp them all scp -r /usr/local/hadoop dn2:/usr/local scp -r /usr/local/hadoop
dn1:/usr/local scp ~/.profile dn1:~/ scp ~/.profile dn2:~/ hadoop namenode -format 38

10-Start it! $HADOOP_HOME/sbin/start-dfs.sh jps ssh dn1 'jps' ssh dn1 'jps'
#if you want to stop them $HADOOP_HOME/sbin/stop-dfs.sh 39

Hadoop的瓶頸 40

Hadoop日趨成熟與瓶頸 • Not fit for small data • Potential stability
issues ◦ open source platform ◦ latest stable version • General limitations ◦ Hadoop is not the only answer ◦ data collection, aggregation and integration 41

從Hadoop到Spark 42

• Hadoop ◦ A full stack MPP system with both
big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers 44

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview 45

HDFS NameNode DataNode DataNode DataNode DataNode HDFS Cluster File Block
Block Block Block 46

MapReduce Deer Bear River Car Car River Dear Car Bear
Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/ 47

Spark vs. Hadoop MapReduce Run programs up to 100x faster
than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/ 48

Hadoop vs. Spark? 49

Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem
HDFS read HDFS read HDFS write HDFS write MapReduce Spark 50

WordCount - MapReduce # mapper.py for line in sys.stdin: line
= line.strip() words = line.split() for word in words: print '%s\t%s' % (word, 1) # reducer.py from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word if current_word == word: print '%s\t%s' % (current_word, current_count) 51

WordCount - Spark import sys lines = sc.textFile('wordcount.txt') counts =
lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])).collect() 52

Apache Spark™ is a powerful open source processing engine built
around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.

Spark的歷史演進 54

A Brief History 2004 2006 2008 2010 2012 2014 2016
MapReduce Paper by Google Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0 55

Spark生態圈 56

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Spark的未來與展望 59

Spark 2.0 • Easier, faster, smarter ◦ Easier SQL API
◦ All 99-TPC-DS queries • DataFrames & Datasets in scala/java • ML pipeline persistence • Algorithms in R ◦ GLM, K-Means, survival analysis, NB, etc 60

Spark Installation 62

Training Materials • Cloudera VM ◦ Cloudera CDH 5.7 •
Spark 1.6.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox 63

Cloudera Quick Start VM • QuickStart Downloads for CDH 5.7
• CDH5 and Cloudera Manager 5 • Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera 64

Spark Installation • Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.6.2 /spark-1.6.2-bin-hadoop2.6.tgz • Tar it
then mv it tar zxvf spark-1.6.2-bin-hadoop2.6.tgz cd spark-1.6.2-bin-hadoop2.6 v2.10.X 7+ 2.6+ 3.1+

Let’s do it 66

Where is Spark? • $ whereis spark • $ ls
/usr/lib/spark • $ /usr/lib/spark/bin/pyspark • > exit() 72

• Python textF=sc.textFile("/usr/lib/spark/LICENSE") textF=sc.textFile("file:///usr/lib/spark/LICENSE") textF.count() textF.take(2) spk_line=textF.filter(lambda line: "Spark" in
line) spk_line.count() 73

Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]
--jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark 74

Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)
val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI 75

master URLs • local, local[K], local[*] • spark://HOST:PORT • mesos://HOST:PORT
• yarn-client • yarn-cluster 76

Spark Standalone Mode • Launch standalone cluster ◦ master ◦
slaves ◦ public key 77

Which one is better? 78

Spark RDD的原理 79

“Spark provides is a resilient distributed dataset (RDD), which is
a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ... 80

RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)
sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ... 81

How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")
val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) 82

Key－Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,
1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey 83

Narrow Dependencies map, filter 84

Narrow Dependencies union 85

Wide Dependencies groupByKey, reduceByKey 86

Stage3 Stage A groupBy B C map D E union
F G join Stage1 Stage2 87

lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)
# Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance 88

Cache • RDD persistence • Caching is a key tool
for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()

Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])
broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value

# Word Count import sys lines = sc.textFile('wordcount.txt') counts =
lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5) 91

Get Your Hands Dirty 92

93 # raw data wordsList = ["apple", "banana", "strawberry", "lemon",
"apple", "banana", "apple", "apple", "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): return word + 's' print makePlural('banana') # 將上述產生的python function makePlural傳遞到map()中達到每個水果字串字尾+s pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # 利用lambda匿名函數,讓每個進來的水果字串字尾+s, 不用上述的python function pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect()

# 利用map()計算每個水果單字的length(), 請用len()這個python自帶的長度函數 pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths #
Hint: We can create the pair RDD using the map() transformation with a lambda() function to create a new RDD. 請建一個pairRDD, 每一個水果單字進來都給他一個數字1 wordPairs = wordsRDD.<FILL IN> print wordPairs.collect() # 利用mapValues()計算每一個水果的出現的次數 wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # 請用reduceByKey()針對水果字串wordPairs做加總,需要得到每個水果對應的次數 wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # 將上述的map, reduceByKey, RDD的過程以一行表示, 並且用collect(), print之需得到最後 wordcount的答案 wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected 94

Best Practice 95

# Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)
(乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey 96

# Don’t copy all elements to driver • Scala val
values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling 97

# Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",
# Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable") 98

# Number of data partitions? • Spark application UI •
Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python 99

HadoopCon 2016 9/9-9/10 8th Annual Hadoop Meetup in Taiwan http://2016.hadoopcon.org/wp/
100

Advanced 101

Components • Data storage ◦ HDFS, hadoop compatible • API
◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos -> $./bin/spark-shell --master mesos://host:5050 ◦ Yarn -> $./bin/spark-shell --master --deply-mode client ◦ AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc

Cluster Object Driver Program Cluster Manager Worker Node Worker Node
Standalone, Yarn, or Mesos SparkContext Master

Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache
Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py

Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache
Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP

Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker
Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task

Hadoop vs. Spark Today

Hadoop vs. Spark Today

More Decks by Erica Li

Other Decks in Technology

Featured

Transcript