From Big Data to Fast Data (DevFest)

Slide 1

Slide 1 text

From Big Data to Fast Data Apache Spark Stefano Baghino / GDG Milano

Slide 2

Slide 2 text

Hi! ●  Collaborator @ GDG Milano ●  Software Engineer @ DATABIZ I stare at shiny rectangles and push buttons. Sometimes good things happen and people pay me for this!

Slide 3

Slide 3 text

Wait! Who are you? ●  Are you a lambda dancer? ●  Do you have a PB of doges?

Slide 4

Slide 4 text

Agenda Big Data? Fast Data? What do we have now? How can we do better? What is Spark? What does it do? How does it work? And also code, somewhere here and there.

Slide 5

Slide 5 text

Big Data Huge and diverse datasets. Just for today, size matters. You need a cluster, you need horizontal scalability. New problems require new solutions.

Slide 6

Slide 6 text

What about Fast Data? You have a massive, endless stream of data. And you can’t wait for a batch computation to complete. You must have an answer now.

Slide 7

Slide 7 text

What do we have now? MapReduce is one of the leading cluster computing frameworks. It scales beautifully, but... But it does I/O at each step. And the computational model is too restrictive.

Slide 8

Slide 8 text

Meet Apache Spark It works with your cluster. Simple, expressive API with multi-language support. Computes and caches in-memory. It’s fast!

Slide 9

Slide 9 text

Fast? Really? Yup! Hadoop 102.5 TB Spark 100 TB Spark 1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Slide 10

Slide 10 text

Enjoy the API anyway you want Java Scala Python

Slide 11

Slide 11 text

How does it play with my cluster? Local Standalone Hadoop Mesos

Slide 12

Slide 12 text

A lot of interest around it All major Big Data players are supporting it. Available on the cloud from Spark makers, Databricks. Also available for Click to Deploy on Google Compute Engine*. * https://cloud.google.com/solutions/hadoop/click-to-deploy

Slide 13

Slide 13 text

Where do things go?

Slide 14

Slide 14 text

Introducing the RDD Say hi to Spark’s main data abstraction, the Resilient Distributed Dataset From the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.

Slide 15

Slide 15 text

CODE!!!

Slide 16

Slide 16 text

The anatomy of an RDD

Slide 17

Slide 17 text

What about resilience? When you have a large cluster, things fail often. You must have a way to recover from failures. RDDs hide a little secret.

Slide 18

Slide 18 text

The true nature of an RDD

Slide 19

Slide 19 text

What can I do with an RDD? Transformations Produce a new RDD, extending the graph at each step (map, flatMap, filter and the like) Actions They are “terminal” operations, executing the graph (like collect or reduce)

Slide 20

Slide 20 text

The execution model 1.  Create a DAG of RDDs to represent computation 2.  Create the logical execution plan for the DAG 3.  Schedule and execute individual tasks

Slide 21

Slide 21 text

The execution model in action Let’s look at the process in slow motion. Let’s count distinct names grouped by their initial.

Slide 22

Slide 22 text

First, create the logical DAG

Slide 23

Slide 23 text

Then, create the execution plan

Slide 24

Slide 24 text

So, how is it resilient? RDDs are “just” a lazy, immutable representation of computation. If something goes wrong, Spark will simply try again.

Slide 25

Slide 25 text

The ecosystem

Slide 26

Slide 26 text

The ecosystem

Slide 27

Slide 27 text

Spark Streaming

Slide 28

Slide 28 text

“Mini-batching” These “mini-batches” are DStreams (discretized streams). It’s just a collection of RDDs. Create your DStreams from streaming sources. Apply transformations just as you would with RDDs.

Slide 29

Slide 29 text

CODE!!! Available on Github: https://github.com/stefanobaghino/spark-twitter-stream-example

Slide 30

Slide 30 text

Homework! ●  Caching ●  Shared variables ●  Partitioning optimization ●  DataFrames ●  Know the API ●  Learn the ecosystem

Slide 31

Slide 31 text

THANKS! Psst... we’re hiring! Java? Scala? Big Data? Front End? Come talk to me afterwards. @stefanobaghino [email protected]