Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

Spark Introduction 20160426 NTU

Erica Li
April 25, 2016

Spark Introduction 20160426 NTU

Hadoop MapReduce vs.. Spark, Spark introduction, from RDD to infra, from streaming to graphx. In this lecture, you will catch the scope of Spark, know its pros and cons. (spark 1.6.1)

Erica Li

April 25, 2016
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. About Me @Shrimp_li ericalitw What I am working for? Data

    Scientist Girls in Tech Taiwan ElasticMining CTO & Co-founder Taiwan Spark User Group Founder Taiwan People’s Food Bank IT consultant
  2. 10 THINGS TO APACHE SPARK 1 2 3 4 5

    6 7 8 9 10 Introduction Hadoop vs. Spark Spark Features Spark Ecosystem Spark Architect RDD Streaming SQL DataFrame MLlib GraphX
  3. Apache Spark™ is a powerful open source processing engine built

    around speed, ease of use, and sophisticated analytics. It was originally developed at UC Berkeley in 2009.
  4. A Brief Story 2004 2006 2008 2010 2012 2014 2016

    MapReduce Paper Hadoop @Yahoo! Hadoop Summit Spark Paper Apache Spark Won What’s Next? Spark 2.0?
  5. Hadoop A full stack massive parallel processing (MPP) system with

    both big data (HDFS) and parallel execution model (MapReduce) Spark An open source parallel processing framework that enables users to run large-scale data analytics applications across clustered computers
  6. Spark vs. Hadoop MapReduce 2014 Sort Benchmark Competition 機器數量 時間

    Hadoop MapR 2100 72m Spark 207 23m Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. http://spark.apache.org/
  7. Spark vs. Hadoop MapReduce Input mem Input Iterator Iterator mem

    HDFS read HDFS read HDFS write HDFS write MapReduce Spark
  8. Spark Features • Write in Scala • Run on JVM

    • Upgrade MapReduce to next level • In-memory data storage • Near real-time processing • Lazy evaluation of queries
  9. Components Data Storage HDFS, hadoop compatible Management Framework Standalone Mesos

    $./bin/spark-shell --master mesos://host:5050 Yarn $./bin/spark-shell --master --deply-mode client AWS EC2 Distributed computing API (scala, python Java) Storage HDFS, etc
  10. Launch Cluster Driver Program Cluster Manager Worker Node Worker Node

    Standalone, Yarn, or Mesos SparkContext Master
  11. Cluster Processing Driver Program SparkContext Cluster Manager Worker Node Cache

    Executor Task Task Master *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py
  12. Client Mode (default) Driver (Client) SparkContext Master Worker Node Cache

    Executor Task Task *.jar *.py Worker Node Cache Executor Task Task *.jar *.py http://spark.apache.org/ *.jar *.py Submit APP
  13. Cluster Mode Worker Node Driver Worker Node Cache Executor Task

    Task Worker Node Cache Executor Task Task Driver (Client) SparkContext Master Submit APP
  14. Spark on Yarn Spark Yarn Client Resource Manager Node Manager

    Application Master Spark Context DAG Scheduler YarnClusterScheduler Node Manager Container ExecutorBackend Container ExecutorBackend Executor Executor 1.Request 2.Assign AppMaster 3.Invoke AppMaster 5. Apply Container 3.Invoke Container 6.Assign Container
  15. Spark Installation Downloading Wget http://www.apache.org/dyn/closer. lua/spark/spark-1.6.1/spark-1.6.1-bin- hadoop2.6.tgz Tar it then

    mv it tar zxvf spark-1.6.1-bin-hadoop2.6.tgz cd spark-1.6.1-bin-hadoop2.6 v2.10.X+ v7+ V2.6+ V3.1+
  16. Resilient Distributed Datasets “Spark provides is a resilient distributed dataset

    (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.” Partition1 Partition2 ...
  17. Transformations Actions • map(func) • flatMap(func) • filter(func) • groupByKey()

    • reduceByKey() • union(other) • sortByKey() • reduce(func) • collect() • first() • take(n) • saveAsTextFile(path) • countByKey() RDD operations Here are some common operations
  18. Cache • RDD persistence • Caching is a key tool

    for iterative algorithms and fast interactive use • Usage
  19. 1. Avoid GroupByKey (甲,1) (甲,1) (乙,1) (乙,1) (甲,1) (甲,2) (甲,1)

    (乙,1) (乙,3) (乙,1) (乙,1) (甲,1) (甲,3) (甲,2) (乙,1) (乙,4) (乙,3) (甲,1) (乙,1) (甲,1) (甲,1) (乙,1) (乙,1) (乙,1) (甲,1) (甲,3) (甲,1) (甲,1) (乙,1) (乙,4) (乙,1) (乙,1) (乙,1) ReduceByKey GroupByKey
  20. 2. Don’t copy all elements to driver take() sample countByValue()

    countByKey() collectAsMap() save as file filtering/sampling
  21. Use cases • Streaming ETL ◦ Uber - Kafka, HDFS

    ◦ Coviva - Quality of live videos • Data Enrichment ◦ KKBOX • Complex Session Analysis ◦ Pinterest - immediate user behavior • Trigger Event Detection ◦ UITOX
  22. DATASETS API • Released in Spark 1.6 • RDD +

    Encoder • RDD with catalyst optimizor
  23. M Armbrust et al. (2015) Spark SQL: Relational Data Processing

    in Spark Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plan Cost Model Selected Physical Plan RDDs SQL Query DataFrame Analysis Logical Optimization Physical Planning Code Generation Phase of query planning
  24. MLlib • MLlib is Spark’s machine learning (ML) library •

    Its goal is to make practical machine learning scalable and easy • It consists of common learning algorithms and utilities
  25. Models & Use cases • Classification & Regression • Collaborative

    Filtering • Clustering • Dimensionality Reduction • Feature Extraction... • User behavior analysis ◦ PIXNET ◦ Pinkoi
  26. Hiring! (Full-time) Frontend Developer Job Description • HTML5/CSS3 • JavaScript

    • AngularJS • SASS • SEO • jQuery • Bootstrap • RWD/Mobile First Design http://www.wetogether.co/hiring.html Backend Developer Job Description • Web • Python/Scala • MVC • HTTP • Linux • Database (SQL or No-SQL) Email me!