Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Big Data to Fast Data

From Big Data to Fast Data

An introduction to Apache Spark.

Stefano Baghino

June 25, 2015
Tweet

More Decks by Stefano Baghino

Other Decks in Technology

Transcript

  1. Really, what is it? ◎ Data that cannot be stored on

    a single box ◎ Requires horizontal scalability ◎ Requires a shift from traditional solutions
  2. Disk I/O all the time Each step reads input from

    and writes output to disk Let’s look at MapReduce Limited model It’s difficult to fit all algos in the MapReduce model
  3. Ok, so what is so good about Spark? May sit

    on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.
  4. Fast? Really? Yes! Hadoop 102.5 TB Spark 100 TB Spark

    1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  5. Working with Spark ◎ Resilient Distributed Dataset ◎ Closely resembles a Scala

    collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.
  6. Transformations Produce a new RDD, extending the execution graph at

    each step e.g.: ◎  map ◎  flatMap ◎  filter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: ◎  collect ◎  reduce
  7. The execution model 1.  Create DAG of RDDs to represent

    comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks
  8. The execution model in action Let’s count distinct names grouped

    by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()
  9. Step 1: Create the logical DAG HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD

    Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()
  10. Step 2: Create the execution plan ◎ Pipeline as much as

    possible ◎ Split into “stages” based on need to “shuffle” data HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2
  11. So, how is it a Resilient Distributed Dataset? Being a

    lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.
  12. The ecosystem Spark SQL Structured data Spark Streaming Real-time MLLib

    Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  13. What we’ll see today: Spark Streaming Spark SQL Structured data

    Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  14. “Mini-batches” are DStreams These “mini-batches” are DStreams or discretized streams

    and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.
  15. Example Twitter streaming “Sentiment analysis” for dummies Sure, it’s on

    Github! https://github.com/stefanobaghino/spark-twitter-stream-example
  16. A lot more to be said! ◎ Caching ◎ Shared variables ◎ Partioning

    optimization ◎ DataFrames ◎ A huge API ◎ A huge ecosystem
  17. The ecosystem: next time! Spark SQL Structured data Spark Streaming

    Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis