Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Big Data to Fast Data (Codemotion)

From Big Data to Fast Data (Codemotion)

An introduction to Apache Spark

Stefano Baghino

November 20, 2015
Tweet

More Decks by Stefano Baghino

Other Decks in Programming

Transcript

  1. From Big Data to Fast Data An introduction to Apache

    Spark Stefano Baghino Codemotion Milan 2015
  2. From Big Data to Fast Data with Functional Reactive Containerized

    Microservices and AI- driven Monads in a galaxy far far away…
  3. Hello! I am Stefano Baghino Software Engineer @ DATABIZ [email protected]

    @stefanobaghino Favorite PL: Scala My hero: XKCD’s Beret Guy What I fear: [object Object]
  4. Agenda u Big Data? u Fast Data? u What do we have now?

    u How can we do better? u What is Spark? u What does it do? u How does it work? And also code, somewhere here and there.
  5. Really, what is it? u Data that cannot be stored on

    a single box u Requires horizontal scalability u Requires a shift from traditional solutions
  6. Disk I/O all the time Each step reads input from

    and writes output to disk Let’s look at MapReduce Limited model It’s difficult to fit all algos in the MapReduce model
  7. Ok, so what is so good about Spark? May sit

    on top of an existing Hadoop deployment. Builds heavily on simple functional programming ideas. Computes and caches data in- memory to deliver blazing performances.
  8. Fast? Really? Yes! Hadoop 102.5 TB Spark 100 TB Spark

    1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  9. Working with Spark ◎ Resilient Distributed Dataset ◎ Closely resembles a Scala

    collection ◎ Very natural to use for Scala devs By the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.
  10. Transformations Produce a new RDD, extending the execution graph at

    each step e.g.: u  map u  flatMap u  filter What can I do with an RDD? Actions They are “terminal” operations, actually calling for the execution to extract a value e.g.: u  collect u  reduce
  11. The execution model 1.  Create DAG of RDDs to represent

    comp. 2.  Create logical execution plan for the DAG 3.  Schedule and execute individual tasks
  12. The execution model in action Let’s count distinct names grouped

    by their initial sc.textFile("hdfs://...") .map(n => (n.charAt(0), n)) .groupByKey() .mapValues(n => n.toSet.size) .collect()
  13. Step 1: Create the logical DAG HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD

    Array[(Char, Int)] sc.textFile... map(n => (n.charAt(0),... groupByKey() mapValues(n => n.toSet... collect()
  14. Step 2: Create the execution plan u Pipeline as much as

    possible u Split into “stages” based on the need to “shuffle” data HadoopRDD MappedRDD ShuffledRDD MappedValuesRDD Array[(Char, Int)] Alice Bob Andy (A, Alice) (B, Bob) (A, Andy) (A, (Alice, Andy)) (B, Bob) (A, 2) Res0 = [(A, 2),….] (B, 1) Stage 1 Res0 = [(A, 2), (B, 1)] Stage 2
  15. So, how is it a Resilient Distributed Dataset? Being a

    lazy, immutable representation of computation, rather than an actual collection of data, RDDs achieve resiliency by simply being re-executed when their results are lost*. * because distributed systems and Murphy’s Law are best buddies.
  16. The ecosystem Spark SQL Structured data Spark Streaming Real-time MLLib

    Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  17. What we’ll see today: Spark Streaming Spark SQL Structured data

    Spark Streaming Real-time MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis
  18. “Mini-batches” are DStreams These “mini-batches” are DStreams or discretized streams

    and they are basically a collection of RDDs. DStreams can be created from streaming sources or by applying transformations to an existing DStream.
  19. Example Twitter streaming “Sentiment analysis” for dummies Sure, it’s on

    Github! https://github.com/stefanobaghino/spark-twitter-stream-example
  20. A lot more to be said! u Caching u Shared variables u Partioning

    optimization u DataFrames u A huge API u A huge ecosystem
  21. Tomorrow at Codemotion! Spark SQL Structured data Spark Streaming Real-time

    MLLib Machine learning GraphX Graph processing Spark Core Standalone Scheduler YARN Mesos Spark R Stat. analysis