Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Spark Introduction 20160426 NTU

906495aee953d1a6dc3d661d28da0081?s=47 Erica Li
April 25, 2016

Spark Introduction 20160426 NTU

Hadoop MapReduce vs.. Spark, Spark introduction, from RDD to infra, from streaming to graphx. In this lecture, you will catch the scope of Spark, know its pros and cons. (spark 1.6.1)


Erica Li

April 25, 2016


  1. APACHE SPARK TODAY! Erica Li 2016-04-26

  2. About Me @Shrimp_li ericalitw What I am working for? Data

    Scientist Girls in Tech Taiwan ElasticMining CTO & Co-founder Taiwan Spark User Group Founder Taiwan People’s Food Bank IT consultant
  3. About Me

  4. 10 THINGS TO APACHE SPARK 1 2 3 4 5

    6 7 8 9 10 Introduction Hadoop vs. Spark Spark Features Spark Ecosystem Spark Architect RDD Streaming SQL DataFrame MLlib GraphX

  6. None
  7. 1 What is Spark freegoogleslidestemplates.com

  8. None
  9. Apache Spark™ is a powerful open source processing engine built

    around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
  10. A Brief Story 2004 2006 2008 2010 2012 2014 2016

    MapReduce Paper Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0?
  11. 2 Hadoop vs. Spark freegoogleslidestemplates.com

  12. None
  13. None
  14. Hadoop A full stack massive parallel processing (MPP) system with

    both big data (HDFS) and parallel execution model (MapReduce) Spark An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers
  15. Spark vs. Hadoop MapReduce 2014 Sort Benchmark Competition 機器數量 時間

    Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/
  16. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark
  17. None
  18. 3 Spark Features freegoogleslidestemplates.com

  19. Spark Features • Write in Scala • Run on JVM

    • Upgrade MapReduce to next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries
  20. Available APIs It currently support the following language for developing

    using Spark (v1.4+)
  21. 4 Spark Ecosystem

  22. https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

  23. 5 Architecture

  24. Components Data Storage HDFS, hadoop compatible Management Framework Standalone Mesos

    $./bin/spark-shell --master mesos://host:5050 Yarn $./bin/spark-shell --master --deply-mode client AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc
  25. Standalone Mode Launch Cluster $./bin/spark-shell --master spark://IP:PORT • Master •

    Slaves • Public keys
  26. Launch Cluster Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  27. Cluster Processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  28. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  29. Cluster Mode Worker Node Driver Worker Node Cache Executor Task

    Task Worker Node Cache Executor Task Task Driver (Client) SparkContext Master Submit APP
  30. Spark on Yarn Spark Yarn Client Resource Manager Node Manager

    Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container
  31. It depends… On your infra Which one is better?

  32. Spark Installation Downloading Wget http://www.apache.org/dyn/closer. lua/spark/spark-1.6.1/spark-1.6.1-bin- hadoop2.6.tgz Tar it then

    mv it tar zxvf spark-1.6.1-bin-hadoop2.6.tgz cd spark-1.6.1-bin-hadoop2.6 v2.10.X+ v7+ V2.6+ V3.1+
  33. Resilient Distributed Datasets “Spark provides is a resilient distributed dataset

    (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...
  34. Transformations Actions • map(func) • flatMap(func) • filter(func) • groupByKey()

    • reduceByKey() • union(other) • sortByKey() • reduce(func) • collect() • first() • take(n) • saveAsTextFile(path) • countByKey() RDD operations Here are some common operations
  35. How to Create RDD Java Scala Python

  36. Key-Value RDD For example • mapValues • groupByKey • reduceByKey

  37. Narrow Dependencies map, filter

  38. Narrow Dependencies union

  39. Wide Dependencies groupByKey, reduceByKey

  40. Stage Stage3 A groupBy B C map D E union

    F G join Stage1 Stage2
  41. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage
  42. Shared Variables Broadcast variables Accumulators

  43. lines errors HDFS erros Fault Tolerance

  44. #Word count (python)

  45. #word count (scala)

  46. #word count (scala) New in 1.6+

  47. #word count (scala)

  48. BEST PRACTICE Show time freegoogleslidestemplates.com

  49. 1. Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey
  50. 2. Don’t copy all elements to driver take() sample countByValue()

    countByKey() collectAsMap() save as file filtering/sampling
  51. 3. Bad input data (python)

  52. 4. Number of partitions Spark Application UI Inspect it programmatically

  53. 6 Spark Streaming

  54. None
  55. None
  56. Use cases • Streaming ETL ◦ Uber - Kafka, HDFS

    ◦ Coviva - Quality of live videos • Data Enrichment ◦ KKBOX • Complex Session Analysis ◦ Pinterest - immediate user behavior • Trigger Event Detection ◦ UITOX
  57. 7 Spark SQL

  58. M Armbrust et al. (2015) Spark SQL: Relational Data Processing

    in Spark Spark 1.3
  59. https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ Spark 1.3

  60. DATASETS API • Released in Spark 1.6 • RDD +

    Encoder • RDD with catalyst optimizor
  61. M Armbrust et al. (2015) Spark SQL: Relational Data Processing

    in Spark Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan Cost Model Selected Physical Plan RDDs SQL Query DataFrame Analysis Logical Optimization Physical Planning Code Generation Phase of query planning
  62. Use cases • Data Query ◦ VPON ◦ Foxconn ◦

    Appier ◦ EBC ◦ ...
  63. 8 Spark MLlib

  64. MLlib • MLlib is Spark’s machine learning (ML) library •

    Its goal is to make practical machine learning scalable and easy • It consists of common learning algorithms and utilities
  65. Models & Use cases • Classification & Regression • Collaborative

    Filtering • Clustering • Dimensionality Reduction • Feature Extraction... • User behavior analysis ◦ PIXNET ◦ Pinkoi
  66. 9 GraphX

  67. None
  68. None
  69. Case Study 10

  70. None
  71. Memory issue, small files Multi-use environment... CRAZY ERROR Five things

    we hate about SPARK!
  72. Show you are Better 1st.

  73. 2016-04-24

  74. HadoopCon 2016 9/9-9/10 8th Annual Hadoop Meetup in Taiwan http://2016.hadoopcon.org/wp/

  75. Hiring! (Full-time) Frontend Developer Job Description • HTML5/CSS3 • JavaScript

    • AngularJS • SASS • SEO • jQuery • Bootstrap • RWD/Mobile First Design http://www.wetogether.co/hiring.html Backend Developer Job Description • Web • Python/Scala • MVC • HTTP • Linux • Database (SQL or No-SQL) Email me!
  76. None
  77. Thank You