Spark Introduction 2015

Spark Introduction Erica Li

Erica Li • shrimp_li • ericalitw • Data Scientist •
NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group

https://github.com/wlsherica

Agenda • What is Spark • Hadoop vs. Spark •
Spark Features • Spark Ecosystem • Spark Architecture • Resilient Distributed Datasets • Installation

What is Spark

What’s Spark File System (HDFS, S3, Local, etc) Worker Node
Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Driver Program Spark Context Cluster Master

Spark Data Computation Flow File System (HDFS, S3, Local, etc)
RDD1 Partition Transformation1 Result Actions Partition Partition Partition ………. RDD2 Partition Partition Partition Partition ………. TransformationN ……………………………………………….. RDD3 Partition Partition Partition Partition ……….

Hadoop vs. Spark

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

• Hadoop ◦ A full stack MPP system with both
big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers

HDFS NameNode DataNode DataNode DataNode DataNode HDFS Cluster File Block
Block Block Block

MapReduce Deer Bear River Car Car River Dear Car Bear
Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

Spark vs. Hadoop MapReduce Run programs up to 100x faster
than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/

Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem
HDFS read HDFS read HDFS write HDFS write MapReduce Spark

Spark Features

Spark Features • Writed in Scala • Runned on JVM
• Upgrade MapReduce to the next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries

Available APIs • It currently support the following language for
developing using Spark (v1.4+)

Spark Ecosystem

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Spark Architecture

Components • Data storage ◦ HDFS, hadoop compatible • API
◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos ◦ Yarn Distributed computing API (scala, python Java) Storage HDFS, etc

Cluster Object Driver Program Cluster Manager Worker Node Worker Node
Standalone, Yarn, or Mesos SparkContext Master

Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache
Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py

Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache
Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP

Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker
Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task

Spark on Yarn Spark Yarn Clien Resource Manager Node Manager
Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container

How to Install Spark

Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •
Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox

Cloudera Quick Start VM • CDH5 and Cloudera Manager 5
• Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera

Spark Installation • Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1 /spark-1.4.1-bin-hadoop2.6.tgz • Tar it
then mv it tar zxvf spark-1.4.1-bin-hadoop2.6.tgz cd spark-1.4.1-bin-hadoop2.6

Let’s do it

Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]
--jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark

Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)
val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI

master URLs • local, local[K], local[*] • spark://HOST:PORT • mesos://HOST:PORT
• yarn-client • yarn-cluster

Which one is better?

Spark Standalone Mode • Launch standalone cluster ◦ master ◦
slaves ◦ public key • How to launch?

Resilient Distributed Datasets

“Spark provides is a resilient distributed dataset (RDD), which is
a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...

RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)
sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ...

How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")
val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data)

Key－Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,
1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey

Narrow Dependencies map, filter

Narrow Dependencies union

Wide Dependencies groupByKey, reduceByKey

Stage3 Stage A groupBy B C map D E union
F G join Stage1 Stage2

Cache • RDD persistence • Caching is a key tool
for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()

Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])
broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value

lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)
# Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance

# Word Count import sys lines = sc.textFile('wordcount.txt') counts =
lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5)

TODO - Python

wordsList = ["apple", "banana", "strawberry", "lemon", "apple", "banana", "apple", "apple",
"apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): # Adds an 's' to `word` return word + 's' print makePlural('cat') # TODO: Now pass each item in the base RDD into a map() pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # TODO: Let's create the same RDD using a `lambda` function. pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect() # TODO: Now use `map()` and a `lambda` function to return the number of characters in each word. pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths # TODO: The next step in writing our word counting program is to create a new type of RDD, called a pair RDD # Hint: # We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD. wordPairs = wordsRDD.<FILL IN> print wordPairs.collect()

# TODO: Use `groupByKey()` to generate a pair RDD of
type `('word', iterator)` wordsGrouped = wordPairs.<FILL IN> for key, value in wordsGrouped.collect(): print '{0}: {1}'.format(key, list(value)) # TODO: Use `mapValues() to obtain the counts wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # TODO: Counting using `reduceByKey` wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # TODO: put all together wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected https://github. com/wlsherica/StarkTechnology/tree/ master

Best Practice

# Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)
(乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey

# Don’t copy all elements to driver • Scala val
values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling

# Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",
# Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable")

# Number of data partitions? • Spark application UI •
Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python

5 Things We Hate about Spark

• Memory issue • The small files problem • Spark
streaming • Python? • Random crazy errors http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html

Review

• Component ◦ Driver, Master, and Worker • Spark mode
• RDD operations ◦ Transformations ◦ Actions • Performance?

Spark Introduction 2015

Spark Introduction 2015

More Decks by Erica Li

Other Decks in Technology

Featured

Transcript