Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Introduction 2015

Erica Li
December 04, 2015

Spark Introduction 2015

Introduce Spark and its ecosystem to students. In this lecture, you will learn what is spark, spark features, spark architect, RDD, how to install spark, and best practice on it.

Erica Li

December 04, 2015
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. Erica Li • shrimp_li • ericalitw • Data Scientist •

    NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group
  2. Agenda • What is Spark • Hadoop vs. Spark •

    Spark Features • Spark Ecosystem • Spark Architecture • Resilient Distributed Datasets • Installation
  3. What’s Spark File System (HDFS, S3, Local, etc) Worker Node

    Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Worker Node Cache Task Task Task Executor Driver Program Spark Context Cluster Master
  4. Spark Data Computation Flow File System (HDFS, S3, Local, etc)

    RDD1 Partition Transformation1 Result Actions Partition Partition Partition ………. RDD2 Partition Partition Partition Partition ………. TransformationN ……………………………………………….. RDD3 Partition Partition Partition Partition ……….
  5. • Hadoop ◦ A full stack MPP system with both

    big data (HDFS) and parallel execution model (MapReduce) • Spark ◦ An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers
  6. MapReduce Deer Bear River Car Car River Dear Car Bear

    Deer Bear River Dear Car Bear Car Car River Splitting Deer, 1 Bear, 1 River, 1 Dear, 1 Car, 1 Bear, 1 Car, 1 Car, 1 River, 1 Mapping Bear, 1 Bear, 1 Dear, 1 Dear, 1 Car, 1 Car, 1 Car, 1 Shuffuling River, 1 River, 1 Bear, 2 Dear, 2 Car, 3 Reducing River, 2 Bear, 2 Car, 3 Dear, 3 River, 2 http://www.alex-hanna.com/tworkshops/lesson-5-hadoop-and-mapreduce/
  7. Spark vs. Hadoop MapReduce Run programs up to 100x faster

    than Hadoop MapReduce in memory, or 10x faster on disk. 2014 Sort Benchmark Competition 機器數量 時間 Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/
  8. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark
  9. Spark Features • Writed in Scala • Runned on JVM

    • Upgrade MapReduce to the next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries
  10. Components • Data storage ◦ HDFS, hadoop compatible • API

    ◦ Scala, Python, Java, R • Management framework ◦ Standalone ◦ Mesos ◦ Yarn Distributed computing API (scala, python Java) Storage HDFS, etc
  11. Cluster Object Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  12. Cluster processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  13. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  14. Cluster Mode Driver (Client) SparkContext Master Worker Node Driver Worker

    Node Cache Executor Task Task http://spark.apache.org/ Submit APP Worker Node Cache Executor Task Task
  15. Spark on Yarn Spark Yarn Clien Resource Manager Node Manager

    Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container
  16. Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •

    Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox
  17. Cloudera Quick Start VM • CDH5 and Cloudera Manager 5

    • Account ◦ username: cloudera ◦ passwd: cloudera • The root account password is cloudera
  18. Spark Shell • Scala ./bin/spark-shell --master local[4] ./bin/spark-shell --master local[4]

    --jars urcode.jar • Python ./bin/pyspark --master local[4] PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark
  19. Initializing SparkContext • Scala val conf = new SparkConf().setAppName(appName) .setMaster(master)

    val sc = new SparkContext(conf) • Python conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) A name of your application to show on cluster UI
  20. “Spark provides is a resilient distributed dataset (RDD), which is

    a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop- supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” What’s RDD “Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...
  21. RDD operations • Transformations map(func) flatMap(func) filter(func) groupByKey() reduceByKey() union(other)

    sortByKey() ... • Actions reduce(func) collect() first() take(n) saveAsTextFile(path) countByKey() ...
  22. How to Create RDD • Scala val rddStr:RDD[String] = sc.textFile("hdfs://..")

    val rddInt:RDD[Int] = sc.parallelize(1 to 100) • Python data = [1, 2, 3, 4, 5] distData = sc.parallelize(data)
  23. Key-Value RDD lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s,

    1)) counts = pairs.reduceByKey(lambda a, b: a + b) • mapValues • groupByKey • reduceByKey
  24. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage yourRDD.cache() yourRDD.persist().is_cached yourRDD.unpersist()
  25. Shared Variables • Broadcast variables broadcastVar = sc.broadcast([100, 200, 300])

    broadcastVar.value • Accumulators (only scala) accum = sc.accumulator(0) sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) accum.value
  26. lines = spark.textFile("hdfs://...") errors = lines.filter(lambda line: "ERROR" in line)

    # Count all the errors errors.cache() errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line) .count() errors.filter(lambda line: "HDFS" in line) .map(lambda x: x.split("\t")) .collect() lines errors HDFS erros filter(lambda line:”error” in line) filter(lambda line:”HDFS” in line) Fault Tolerance
  27. # Word Count import sys lines = sc.textFile('wordcount.txt') counts =

    lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x,y: x+y) output = counts.map(lambda x:(x[1],x[0])) .sortByKey(False) output.take(5)
  28. wordsList = ["apple", "banana", "strawberry", "lemon", "apple", "banana", "apple", "apple",

    "apple", "apple", "apple", "lemon", "lemon", "lemon", "banana", "banana", "banana", "banana", "banana", "banana", "apple", "apple", "apple", "apple"] wordsRDD = sc.parallelize(wordsList, 4) # Print out the type of wordsRDD print type(wordsRDD) def makePlural(word): # Adds an 's' to `word` return word + 's' print makePlural('cat') # TODO: Now pass each item in the base RDD into a map() pluralRDD = wordsRDD.<FILL IN> print pluralRDD.collect() # TODO: Let's create the same RDD using a `lambda` function. pluralLambdaRDD = wordsRDD.<FILL IN> print pluralLambdaRDD.collect() # TODO: Now use `map()` and a `lambda` function to return the number of characters in each word. pluralLengths = (pluralRDD.<FILL IN>) print pluralLengths # TODO: The next step in writing our word counting program is to create a new type of RDD, called a pair RDD # Hint: # We can create the pair RDD using the `map()` transformation with a `lambda()` function to create a new RDD. wordPairs = wordsRDD.<FILL IN> print wordPairs.collect()
  29. # TODO: Use `groupByKey()` to generate a pair RDD of

    type `('word', iterator)` wordsGrouped = wordPairs.<FILL IN> for key, value in wordsGrouped.collect(): print '{0}: {1}'.format(key, list(value)) # TODO: Use `mapValues() to obtain the counts wordCountsGrouped = wordsGrouped.<FILL IN> print wordCountsGrouped.collect() # TODO: Counting using `reduceByKey` wordCounts = wordPairs.<FILL IN> print wordCounts.collect() # TODO: put all together wordCountsCollected = (wordsRDD.<FILL IN>) print wordCountsCollected https://github. com/wlsherica/StarkTechnology/tree/ master
  30. # Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey
  31. # Don’t copy all elements to driver • Scala val

    values = myLargeDataRDD.collect() take() sample countByValue() countByKey() collectAsMap() save as file filtering/sampling
  32. # Bad input data • Python input_rdd = sc.parallelize(["{\"value\": 1}",

    # Good "bad_json", # Bad "{\"value\": 2}", # Good "{\"value\": 3" # Missing brace. ]) sqlContext.jsonRDD(input_rdd).registerTempTable("valueTable")
  33. # Number of data partitions? • Spark application UI •

    Inspect it programatically yourRDD.partitions.size #scala yourRDD.getNumPartitions() #python
  34. • Memory issue • The small files problem • Spark

    streaming • Python? • Random crazy errors http://www.infoworld.com/article/3004460/application-development/5-things-we-hate-about-spark.html
  35. • Component ◦ Driver, Master, and Worker • Spark mode

    • RDD operations ◦ Transformations ◦ Actions • Performance?