Slide 1

Slide 1 text

Spark Introduction Erica Li

Slide 2

Slide 2 text

Erica Li ● shrimp_li ● ericalitw ● Data Scientist ● NPO side project ● Girls in Tech Taiwan ● Taiwan Spark User Group

Slide 3

Slide 3 text

https://github.com/wlsherica

Slide 4

Slide 4 text

Agenda ● What is Spark ● Hadoop vs. Spark ● Spark Features ● Spark Ecosystem ● Spark Architecture ● Resilient Distributed Datasets ● Installation

Slide 5

Slide 5 text

What is Spark

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

What’s Spark File System (HDFS, S3, Local, etc) Worker Node Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Driver Program Spark Context Cluster Master

Slide 8

Slide 8 text

Spark Data Computation Flow File System (HDFS, S3, Local, etc) RDD1 Partition Transformation1 Result Actions Partition Partition Partition ………. RDD2 Partition Partition Partition Partition ………. TransformationN ……………………………………………….. RDD3 Partition Partition Partition Partition ……….

Slide 9

Slide 9 text

Hadoop vs. Spark

Slide 10

Slide 10 text

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Slide 11

Slide 11 text

● Hadoop ○ A full stack MPP system with both big data (HDFS) and parallel execution model (MapReduce) ● Spark ○ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers

Slide 12

Slide 12 text

HDFS NameNode DataNode DataNode DataNode DataNode HDFS Cluster File Block Block Block Block

Slide 13

Slide 13 text

MapReduce Deer Bear River Car Car River Dear Car Bear Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/

Slide 14

Slide 14 text

Spark vs. Hadoop MapReduce Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/

Slide 15

Slide 15 text

Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem HDFS read HDFS read HDFS write HDFS write MapReduce Spark

Slide 16

Slide 16 text

Spark Features

Slide 17

Slide 17 text

Spark Features ● Writed in Scala ● Runned on JVM ● Upgrade MapReduce to the next level ● In-memory data storage ● Near real-time processing ● Lazy evaluation of queries

Slide 18

Slide 18 text

Available APIs ● It currently support the following language for developing using Spark (v1.4+)

Slide 19

Slide 19 text

Spark Ecosystem

Slide 20

Slide 20 text

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Slide 21

Slide 21 text

Spark Architecture

Slide 22

Slide 22 text

Components ● Data storage ○ HDFS, hadoop compatible ● API ○ Scala, Python, Java, R ● Management framework ○ Standalone ○ Mesos ○ Yarn Distributed computing API (scala, python Java) Storage HDFS, etc

Slide 23

Slide 23 text

Cluster Object Driver Program Cluster Manager Worker Node Worker Node Standalone, Yarn, or Mesos SparkContext Master

Slide 24

Slide 24 text

Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py

Slide 25

Slide 25 text

Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP

Slide 26

Slide 26 text

Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task

Slide 27

Slide 27 text

Spark on Yarn Spark Yarn Clien Resource Manager Node Manager Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container

Slide 28

Slide 28 text

How to Install Spark

Slide 29

Slide 29 text

Training Materials ● Cloudera VM ○ Cloudera CDH 5.4 ● Spark 1.3.0 ● 64-bit host OS ● RAM 4G ● VMware, KVM, and VirtualBox

Slide 30

Slide 30 text

Cloudera Quick Start VM ● CDH5 and Cloudera Manager 5 ● Account ○ username: cloudera ○ passwd: cloudera ● The root account password is cloudera

Slide 31

Slide 31 text

Spark Installation ● Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1 /spark-1.4.1-bin-hadoop2.6.tgz ● Tar it then mv it tar zxvf spark-1.4.1-bin-hadoop2.6.tgz cd spark-1.4.1-bin-hadoop2.6

Slide 32

Slide 32 text

Let’s do it

Slide 33

Slide 33 text

Spark Shell ● Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4] --jars urcode.jar ● Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark

Slide 34

Slide 34 text

Initializing SparkContext ● Scala val conf = new SparkConf().setAppName(appName) .setMaster(master) val sc = new SparkContext(conf) ● Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI

Slide 35

Slide 35 text

master URLs ● local, local[K], local[*] ● spark://HOST:PORT ● mesos://HOST:PORT ● yarn-client ● yarn-cluster

Slide 36

Slide 36 text

Which one is better?

Slide 37

Slide 37 text

Spark Standalone Mode ● Launch standalone cluster ○ master ○ slaves ○ public key ● How to launch?

Slide 38

Slide 38 text

Resilient Distributed Datasets

Slide 39

Slide 39 text

“Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...

Slide 40

Slide 40 text

RDD operations ● Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other) sortByKey() ... ● Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ...

Slide 41

Slide 41 text

How to Create RDD ● Scala val rddStr:RDD[String] = sc.textFile("hdfs://..") val rddInt:RDD[Int] = sc.parallelize(1 to 100) ● Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data)

Slide 42

Slide 42 text

Key-Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a + b) ● mapValues ● groupByKey ● reduceByKey

Slide 43

Slide 43 text

Narrow Dependencies map, filter

Slide 44

Slide 44 text

Narrow Dependencies union

Slide 45

Slide 45 text

Wide Dependencies groupByKey, reduceByKey

Slide 46

Slide 46 text

Stage3 Stage A groupBy B C map D E union F G join Stage1 Stage2

Slide 47

Slide 47 text

Cache ● RDD persistence ● Caching is a key tool for iterative algorithms and fast interactive use ● Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()

Slide 48

Slide 48 text

Shared Variables ● Broadcast variables broadcastVar = sc.broadcast([100, 200, 300]) broadcastVar.value ● Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value

Slide 49

Slide 49 text

lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line) # Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance

Slide 50

Slide 50 text

# Word Count import sys lines = sc.textFile('wordcount.txt') counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5)

Slide 51

Slide 51 text

TODO - Python

Slide 52

Slide 52 text

wordsList = ["apple", "banana", "strawberry", "lemon", "apple", "banana", "apple", "apple", "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): # Adds an 's' to `word` return word + 's' print makePlural('cat') # TODO: Now pass each item in the base RDD into a map() pluralRDD = wordsRDD. print pluralRDD.collect() # TODO: Let's create the same RDD using a `lambda` function. pluralLambdaRDD = wordsRDD. print pluralLambdaRDD.collect() # TODO: Now use `map()` and a `lambda` function to return the number of characters in each word. pluralLengths = (pluralRDD.) print pluralLengths # TODO: The next step in writing our word counting program is to create a new type of RDD, called a pair RDD # Hint: # We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD. wordPairs = wordsRDD. print wordPairs.collect()

Slide 53

Slide 53 text

# TODO: Use `groupByKey()` to generate a pair RDD of type `('word', iterator)` wordsGrouped = wordPairs. for key, value in wordsGrouped.collect(): print '{0}: {1}'.format(key, list(value)) # TODO: Use `mapValues() to obtain the counts wordCountsGrouped = wordsGrouped. print wordCountsGrouped.collect() # TODO: Counting using `reduceByKey` wordCounts = wordPairs. print wordCounts.collect() # TODO: put all together wordCountsCollected = (wordsRDD.) print wordCountsCollected https://github. com/wlsherica/StarkTechnology/tree/ master

Slide 54

Slide 54 text

Best Practice

Slide 55

Slide 55 text

# Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1) (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey

Slide 56

Slide 56 text

# Don’t copy all elements to driver ● Scala val values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling

Slide 57

Slide 57 text

# Bad input data ● Python input_rdd = sc.parallelize(["{\"value\": 1}", # Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable")

Slide 58

Slide 58 text

# Number of data partitions? ● Spark application UI ● Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python

Slide 59

Slide 59 text

5 Things We Hate about Spark

Slide 60

Slide 60 text

● Memory issue ● The small files problem ● Spark streaming ● Python? ● Random crazy errors http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html

Slide 61

Slide 61 text

Review

Slide 62

Slide 62 text

● Component ○ Driver, Master, and Worker ● Spark mode ● RDD operations ○ Transformations ○ Actions ● Performance?

Slide 63

Slide 63 text

No content