Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Spark

Introduction to Apache Spark

Steven Borrelli

May 30, 2014
Tweet

More Decks by Steven Borrelli

Other Decks in Technology

Transcript

  1. A PA C H E S PA R K
    S TA M P E D E C O N 2 0 1 4
    S T E V E N B O R R E L L I
    @stevendborrelli
    A S T E R I S

    View full-size slide

  2. A B O U T M E
    F O U N D E R , A S T E R I S ( J A N 2 0 1 4 )
    O R G A N I Z E R O F S T L M A C H I N E
    L E A R N I N G A N D D O C K E R S T L
    S Y S T E M S E N G I N E E R I N G , H P C ,
    B I G D A TA & C L O U D
    N E X T G E N E R AT I O N I N F R A S T R U C T U R E F O R D E V E L O P E R S

    View full-size slide

  3. S PA R K I N F I V E S E C O N D S
    is a replacement for

    View full-size slide

  4. WHY DO WE NEED TO REPLACE
    MAPREDUCE?

    View full-size slide

  5. M A P R E D U C E I S A W E S O M E !
    Allows us to process enormous
    amounts of data in parallel

    View full-size slide

  6. M A P R E D U C E
    M A P R E D U C E : S I M P L I F I E D D ATA P R O C E S S I N G O N L A R G E C L U S T E R S ( 2 0 0 4 ) 

    J E F F R E Y D E A N A N D S A N J AY G H E M AWAT

    View full-size slide

  7. HITTING THE LIMITS OF HADOOP’s
    MAPREDUCE

    View full-size slide

  8. T H E P R O B L E M S W I T H M A P R E D U C E
    !
    API: Low-Level & Complex

    View full-size slide

  9. M A P R E D U C E I S S U E S
    !
    • Latency
    • Execution time impacted by “stragglers”
    • Lack of in-memory caching
    • Intermediate steps persisted to disk
    • No shared state

    View full-size slide

  10. T H E P R O B L E M S W I T H M A P R E D U C E
    !
    Not optimal for:
    M A C H I N E L E A R N I N G G R A P H S S T R E A M P R O C E S S I N G

    View full-size slide

  11. I M P R O V I N G M A P R E D U C E
    A PA C H E T E Z

    View full-size slide

  12. !
    • Generalize to different workloads
    • Sub-Second Latency
    • Scalable and Fault Tolerant
    • Easy to use API
    N E X T M A P R E D U C E : G O A L S

    View full-size slide

  13. T O P S PA R K F E AT U R E S
    • Fast, fault-tolerant in-memory data structures (RDD)
    • Compatibility with Hadoop ecosystem
    • Rich, easy-to-use API supports Machine Learning,
    Graphs and Streaming
    • Interactive Shell


    View full-size slide

  14. S PA R K S TA C K

    View full-size slide

  15. R E S I L I E N T D I S T R I B U T E D D ATA S E T
    • Immutable in-memory collections
    • Fast recovery on failure
    • Control caching and persistence to memory/disk
    • Can partition to avoid shuffles

    View full-size slide

  16. R D D L I N E A G E
    lines = spark.textFile(“hdfs://errors/...”)
    errors = lines.filter(_.startsWith(“ERROR”))
    messages = errors.map(_.split(‘\t’)(2))

    View full-size slide

  17. L A N G U A G E S U P P O R T
    • Spark is written in
    • Uses Scala collections & Akka Actors
    • Java, Python native support (Python support can lag),
    lambda support in Java8/Spark 1.0
    • R Bindings through SparkR
    • Functional programming paradigm

    View full-size slide

  18. R D D T R A N S F O R M AT I O N S
    Transformations create a new RDD
    map
    filter
    flatMap
    sample
    union
    distinct
    groupByKey
    reduceByKey
    sortByKey
    join
    cogroup
    cartesian
    Transformations are evaluated lazily.

    View full-size slide

  19. R D D A C T I O N S
    Actions Return a value
    reduce
    collect
    count
    countByKey
    countByValue
    countApprox
    !
    foreach
    saveAsSequenceFile
    saveAsTextFile
    first
    take(n)
    takeSample
    toArray
    Invoking an Action will cause all previous Transformations to
    be evaluated.

    View full-size slide

  20. TA S K S C H E D U L E R
    H T T P : / / A M P C A M P. B E R K E L E Y. E D U / W P - C O N T E N T / U P L O A D S / 2 0 1 2 / 0 6 / M AT E I - Z A H A R I A - PA R T- 1 - A M P - C A M P - 2 0 1 2 - S PA R K - I N T R O . P D F
    • Runs general
    task graphs
    !
    • Pipelines
    functions
    where possible

    • Cache-aware
    data reuse &
    locality
    !
    • Partitioning-
    aware to avoid
    shuffles

    View full-size slide

  21. SPARK ECOSYSTEM

    View full-size slide

  22. S PA R K S TA C K
    Integrated platform for disparate workloads

    View full-size slide

  23. S PA R K S T R E A M I N G
    • Micro-Batch: Discretized Stream (DStream)
    • ~1 sec latency
    • Fault tolerant
    • Shares Much of the same code as Batch

    View full-size slide

  24. T O P 1 0 H A S H TA G S I N L A S T 1 0 M I N
    // Create the stream of tweets
    val tweets = ssc.twitterStream(, )
    // Count the tags over a 10 minute window
    val tagCounts = tweets.flatMap(statuts => getTags(status))
    .countByValueAndWindow(Minutes(10), Second(1))
    // Sort the tags by counts
    val sortedTags = tagCounts.map
    { case (tag, count) => (count, tag) }
    (_.sortByKey(false))
    // Show the top 10 tags
    sortedTags.foreach(showTopTags(10) _)

    View full-size slide

  25. • 10x + speedup after data is cached
    • In-memory materialized views
    • Supports HiveQL, UDFs, etc.
    • New Catalyst SQL engine coming in 1.0 includes
    SchemaRDD to mix & match RDD/SQL in code.

    View full-size slide

  26. • Implementation of PowerGraph, Pregel on Spark
    • .5x the speed of GraphLab, but more fault-tolerant

    View full-size slide

  27. • Machine Learning library, part of Spark core.
    • Uses jblas & gfortran. Python supports NumPy.
    • Growing number of algorithms: 

    SVM, ALS, Naive Bayes, K-Means, Linear & Logistic
    Regression. (SVD/PCA, CART, L-BGFS coming in 1.x)
    M L L I B

    View full-size slide

  28. • MLI: Higher level library to support Tables
    (dataframes), Linear Algebra, Optimizers.
    • MLI: alpha software, limited activity
    • Can use Scikit-Learn or SparkR to run models on
    Spark.
    M L L I B +

    View full-size slide

  29. C O M M U N I T Y
    0
    50
    100
    150
    200
    Patches
    MapReduce
    Storm
    Yarn
    Spark
    0
    10000
    20000
    30000
    40000
    Lines Added
    MapReduce
    Storm
    Yarn
    Spark
    0
    3500
    7000
    10500
    14000
    Lines Removed
    MapReduce
    Storm
    Yarn
    Spark

    View full-size slide

  30. S PA R K M O M E N T U M
    • 1.0 Released 5/30/2014
    • Databricks investment $14MM Andreessen Horowitz
    • Partnerships with DataStax, Cloudera, MapR, Pivotal

    View full-size slide

  31. T H A N K S !
    [email protected] @stevendborrelli

    View full-size slide