Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Introduction 2015

906495aee953d1a6dc3d661d28da0081?s=47 Erica Li
December 04, 2015

Spark Introduction 2015

Introduce Spark and its ecosystem to students. In this lecture, you will learn what is spark, spark features, spark architect, RDD, how to install spark, and best practice on it.

906495aee953d1a6dc3d661d28da0081?s=128

Erica Li

December 04, 2015
Tweet

Transcript

  1. Spark Introduction Erica Li

  2. Erica Li • shrimp_li • ericalitw • Data Scientist •

    NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group
  3. https://github.com/wlsherica

  4. Agenda • What is Spark • Hadoop vs. Spark •

    Spark Features • Spark Ecosystem • Spark Architecture • Resilient Distributed Datasets • Installation
  5. What is Spark

  6. None
  7. What’s Spark File System (HDFS, S3, Local, etc) Worker Node

    Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Driver Program Spark Context Cluster Master
  8. Spark Data Computation Flow File System (HDFS, S3, Local, etc)

    RDD1 Partition Transformation1 Result Actions Partition Partition Partition ………. RDD2 Partition Partition Partition Partition ………. TransformationN ……………………………………………….. RDD3 Partition Partition Partition Partition ……….
  9. Hadoop vs. Spark

  10. http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

  11. • Hadoop ◦ A full stack MPP system with both

    big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers
  12. HDFS NameNode DataNode DataNode DataNode DataNode HDFS Cluster File Block

    Block Block Block
  13. MapReduce Deer Bear River Car Car River Dear Car Bear

    Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
  14. Spark vs. Hadoop MapReduce Run programs up to 100x faster

    than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/
  15. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark
  16. Spark Features

  17. Spark Features • Writed in Scala • Runned on JVM

    • Upgrade MapReduce to the next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries
  18. Available APIs • It currently support the following language for

    developing using Spark (v1.4+)
  19. Spark Ecosystem

  20. https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

  21. Spark Architecture

  22. Components • Data storage ◦ HDFS, hadoop compatible • API

    ◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos ◦ Yarn Distributed computing API (scala, python Java) Storage HDFS, etc
  23. Cluster Object Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  24. Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  25. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  26. Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker

    Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task
  27. Spark on Yarn Spark Yarn Clien Resource Manager Node Manager

    Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container
  28. How to Install Spark

  29. Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •

    Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox
  30. Cloudera Quick Start VM • CDH5 and Cloudera Manager 5

    • Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera
  31. Spark Installation • Downloading wget http://www.apache.org/dyn/closer.lua/spark/spark-1.4.1 /spark-1.4.1-bin-hadoop2.6.tgz • Tar it

    then mv it tar zxvf spark-1.4.1-bin-hadoop2.6.tgz cd spark-1.4.1-bin-hadoop2.6
  32. Let’s do it

  33. Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]

    --jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark
  34. Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)

    val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI
  35. master URLs • local, local[K], local[*] • spark://HOST:PORT • mesos://HOST:PORT

    • yarn-client • yarn-cluster
  36. Which one is better?

  37. Spark Standalone Mode • Launch standalone cluster ◦ master ◦

    slaves ◦ public key • How to launch?
  38. Resilient Distributed Datasets

  39. “Spark provides is a resilient distributed dataset (RDD), which is

    a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...
  40. RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)

    sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ...
  41. How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")

    val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data)
  42. Key-Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,

    1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey
  43. Narrow Dependencies map, filter

  44. Narrow Dependencies union

  45. Wide Dependencies groupByKey, reduceByKey

  46. Stage3 Stage A groupBy B C map D E union

    F G join Stage1 Stage2
  47. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()
  48. Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])

    broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value
  49. lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)

    # Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance
  50. # Word Count import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5)
  51. TODO - Python

  52. wordsList = ["apple", "banana", "strawberry", "lemon", "apple", "banana", "apple", "apple",

    "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): # Adds an 's' to `word` return word + 's' print makePlural('cat') # TODO: Now pass each item in the base RDD into a map() pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # TODO: Let's create the same RDD using a `lambda` function. pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect() # TODO: Now use `map()` and a `lambda` function to return the number of characters in each word. pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths # TODO: The next step in writing our word counting program is to create a new type of RDD, called a pair RDD # Hint: # We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD. wordPairs = wordsRDD.<FILL IN> print wordPairs.collect()
  53. # TODO: Use `groupByKey()` to generate a pair RDD of

    type `('word', iterator)` wordsGrouped = wordPairs.<FILL IN> for key, value in wordsGrouped.collect(): print '{0}: {1}'.format(key, list(value)) # TODO: Use `mapValues() to obtain the counts wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # TODO: Counting using `reduceByKey` wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # TODO: put all together wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected https://github. com/wlsherica/StarkTechnology/tree/ master
  54. Best Practice

  55. # Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey
  56. # Don’t copy all elements to driver • Scala val

    values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling
  57. # Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",

    # Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable")
  58. # Number of data partitions? • Spark application UI •

    Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python
  59. 5 Things We Hate about Spark

  60. • Memory issue • The small files problem • Spark

    streaming • Python? • Random crazy errors http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html
  61. Review

  62. • Component ◦ Driver, Master, and Worker • Spark mode

    • RDD operations ◦ Transformations ◦ Actions • Performance?
  63. None