Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Big Data to Fast Data (DevFest)

From Big Data to Fast Data (DevFest)

An introduction to Apache Spark.

Stefano Baghino

October 03, 2015
Tweet

More Decks by Stefano Baghino

Other Decks in Programming

Transcript

  1. Hi! •  Collaborator @ GDG Milano •  Software Engineer @

    DATABIZ I stare at shiny rectangles and push buttons. Sometimes good things happen and people pay me for this!
  2. Wait! Who are you? •  Are you a lambda dancer?

    •  Do you have a PB of doges?
  3. Agenda Big Data? Fast Data? What do we have now?

    How can we do better? What is Spark? What does it do? How does it work? And also code, somewhere here and there.
  4. Big Data Huge and diverse datasets. Just for today, size

    matters. You need a cluster, you need horizontal scalability. New problems require new solutions.
  5. What about Fast Data? You have a massive, endless stream

    of data. And you can’t wait for a batch computation to complete. You must have an answer now.
  6. What do we have now? MapReduce is one of the

    leading cluster computing frameworks. It scales beautifully, but... But it does I/O at each step. And the computational model is too restrictive.
  7. Meet Apache Spark It works with your cluster. Simple, expressive

    API with multi-language support. Computes and caches in-memory. It’s fast!
  8. Fast? Really? Yup! Hadoop 102.5 TB Spark 100 TB Spark

    1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  9. A lot of interest around it All major Big Data

    players are supporting it. Available on the cloud from Spark makers, Databricks. Also available for Click to Deploy on Google Compute Engine*. * https://cloud.google.com/solutions/hadoop/click-to-deploy
  10. Introducing the RDD Say hi to Spark’s main data abstraction,

    the Resilient Distributed Dataset From the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.
  11. What about resilience? When you have a large cluster, things

    fail often. You must have a way to recover from failures. RDDs hide a little secret.
  12. What can I do with an RDD? Transformations Produce a

    new RDD, extending the graph at each step (map, flatMap, filter and the like) Actions They are “terminal” operations, executing the graph (like collect or reduce)
  13. The execution model 1.  Create a DAG of RDDs to

    represent computation 2.  Create the logical execution plan for the DAG 3.  Schedule and execute individual tasks
  14. The execution model in action Let’s look at the process

    in slow motion. Let’s count distinct names grouped by their initial.
  15. So, how is it resilient? RDDs are “just” a lazy,

    immutable representation of computation. If something goes wrong, Spark will simply try again.
  16. “Mini-batching” These “mini-batches” are DStreams (discretized streams). It’s just a

    collection of RDDs. Create your DStreams from streaming sources. Apply transformations just as you would with RDDs.
  17. Homework! •  Caching •  Shared variables •  Partitioning optimization • 

    DataFrames •  Know the API •  Learn the ecosystem