Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Spark

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Introduction to Apache Spark

Avatar for Steven Borrelli

Steven Borrelli

May 30, 2014
Tweet

More Decks by Steven Borrelli

Other Decks in Technology

Transcript

  1. A PA C H E S PA R K S

    TA M P E D E C O N 2 0 1 4 S T E V E N B O R R E L L I @stevendborrelli A S T E R I S
  2. A B O U T M E F O U

    N D E R , A S T E R I S ( J A N 2 0 1 4 ) O R G A N I Z E R O F S T L M A C H I N E L E A R N I N G A N D D O C K E R S T L S Y S T E M S E N G I N E E R I N G , H P C , B I G D A TA & C L O U D N E X T G E N E R AT I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S
  3. S PA R K I N F I V E

    S E C O N D S is a replacement for
  4. M A P R E D U C E I

    S A W E S O M E ! Allows us to process enormous amounts of data in parallel
  5. M A P R E D U C E M

    A P R E D U C E : S I M P L I F I E D D ATA P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 ) 
 J E F F R E Y D E A N A N D S A N J AY G H E M AWAT
  6. T H E P R O B L E M

    S W I T H M A P R E D U C E ! API: Low-Level & Complex
  7. M A P R E D U C E I

    S S U E S ! • Latency • Execution time impacted by “stragglers” • Lack of in-memory caching • Intermediate steps persisted to disk • No shared state
  8. T H E P R O B L E M

    S W I T H M A P R E D U C E ! Not optimal for: M A C H I N E L E A R N I N G G R A P H S S T R E A M P R O C E S S I N G
  9. I M P R O V I N G M

    A P R E D U C E A PA C H E T E Z
  10. ! • Generalize to different workloads • Sub-Second Latency •

    Scalable and Fault Tolerant • Easy to use API N E X T M A P R E D U C E : G O A L S
  11. T O P S PA R K F E AT

    U R E S • Fast, fault-tolerant in-memory data structures (RDD) • Compatibility with Hadoop ecosystem • Rich, easy-to-use API supports Machine Learning, Graphs and Streaming • Interactive Shell

  12. R E S I L I E N T D

    I S T R I B U T E D D ATA S E T • Immutable in-memory collections • Fast recovery on failure • Control caching and persistence to memory/disk • Can partition to avoid shuffles
  13. R D D L I N E A G E

    lines = spark.textFile(“hdfs://errors/...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2))
  14. L A N G U A G E S U

    P P O R T • Spark is written in • Uses Scala collections & Akka Actors • Java, Python native support (Python support can lag), lambda support in Java8/Spark 1.0 • R Bindings through SparkR • Functional programming paradigm
  15. R D D T R A N S F O

    R M AT I O N S Transformations create a new RDD map filter flatMap sample union distinct groupByKey reduceByKey sortByKey join cogroup cartesian Transformations are evaluated lazily.
  16. R D D A C T I O N S

    Actions Return a value reduce collect count countByKey countByValue countApprox ! foreach saveAsSequenceFile saveAsTextFile first take(n) takeSample toArray Invoking an Action will cause all previous Transformations to be evaluated.
  17. TA S K S C H E D U L

    E R H T T P : / / A M P C A M P. B E R K E L E Y. E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M AT E I - Z A H A R I A - PA R T- 1 - A M P - C A M P - 2 0 1 2 - S PA R K - I N T R O . P D F • Runs general task graphs ! • Pipelines functions where possible
 • Cache-aware data reuse & locality ! • Partitioning- aware to avoid shuffles
  18. S PA R K S TA C K Integrated platform

    for disparate workloads
  19. S PA R K S T R E A M

    I N G • Micro-Batch: Discretized Stream (DStream) • ~1 sec latency • Fault tolerant • Shares Much of the same code as Batch
  20. T O P 1 0 H A S H TA

    G S I N L A S T 1 0 M I N // Create the stream of tweets val tweets = ssc.twitterStream(<username>, <password>) // Count the tags over a 10 minute window val tagCounts = tweets.flatMap(statuts => getTags(status)) .countByValueAndWindow(Minutes(10), Second(1)) // Sort the tags by counts val sortedTags = tagCounts.map { case (tag, count) => (count, tag) } (_.sortByKey(false)) // Show the top 10 tags sortedTags.foreach(showTopTags(10) _)
  21. • 10x + speedup after data is cached • In-memory

    materialized views • Supports HiveQL, UDFs, etc. • New Catalyst SQL engine coming in 1.0 includes SchemaRDD to mix & match RDD/SQL in code.
  22. • Implementation of PowerGraph, Pregel on Spark • .5x the

    speed of GraphLab, but more fault-tolerant
  23. • Machine Learning library, part of Spark core. • Uses

    jblas & gfortran. Python supports NumPy. • Growing number of algorithms: 
 SVM, ALS, Naive Bayes, K-Means, Linear & Logistic Regression. (SVD/PCA, CART, L-BGFS coming in 1.x) M L L I B
  24. • MLI: Higher level library to support Tables (dataframes), Linear

    Algebra, Optimizers. • MLI: alpha software, limited activity • Can use Scikit-Learn or SparkR to run models on Spark. M L L I B +
  25. C O M M U N I T Y 0

    50 100 150 200 Patches MapReduce Storm Yarn Spark 0 10000 20000 30000 40000 Lines Added MapReduce Storm Yarn Spark 0 3500 7000 10500 14000 Lines Removed MapReduce Storm Yarn Spark
  26. S PA R K M O M E N T

    U M • 1.0 Released 5/30/2014 • Databricks investment $14MM Andreessen Horowitz • Partnerships with DataStax, Cloudera, MapR, Pivotal