From Big Data to Fast Data (DevFest)

From Big Data to Fast Data Apache Spark Stefano Baghino
/ GDG Milano

Hi! •  Collaborator @ GDG Milano •  Software Engineer @
DATABIZ I stare at shiny rectangles and push buttons. Sometimes good things happen and people pay me for this!

Wait! Who are you? •  Are you a lambda dancer?
•  Do you have a PB of doges?

Agenda Big Data? Fast Data? What do we have now?
How can we do better? What is Spark? What does it do? How does it work? And also code, somewhere here and there.

Big Data Huge and diverse datasets. Just for today, size
matters. You need a cluster, you need horizontal scalability. New problems require new solutions.

What about Fast Data? You have a massive, endless stream
of data. And you can’t wait for a batch computation to complete. You must have an answer now.

What do we have now? MapReduce is one of the
leading cluster computing frameworks. It scales beautifully, but... But it does I/O at each step. And the computational model is too restrictive.

Meet Apache Spark It works with your cluster. Simple, expressive
API with multi-language support. Computes and caches in-memory. It’s fast!

Fast? Really? Yup! Hadoop 102.5 TB Spark 100 TB Spark
1 PB Elapsed Time 72’ 23’ 234’ # Cores 50400 6592 6080 Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html

Enjoy the API anyway you want Java Scala Python

How does it play with my cluster? Local Standalone Hadoop
Mesos

A lot of interest around it All major Big Data
players are supporting it. Available on the cloud from Spark makers, Databricks. Also available for Click to Deploy on Google Compute Engine*. * https://cloud.google.com/solutions/hadoop/click-to-deploy

Where do things go?

Introducing the RDD Say hi to Spark’s main data abstraction,
the Resilient Distributed Dataset From the user’s point of view, the RDD is effectively a collection, hiding all the details of its distribution throughout the cluster.

CODE!!!

The anatomy of an RDD

What about resilience? When you have a large cluster, things
fail often. You must have a way to recover from failures. RDDs hide a little secret.

The true nature of an RDD

What can I do with an RDD? Transformations Produce a
new RDD, extending the graph at each step (map, flatMap, filter and the like) Actions They are “terminal” operations, executing the graph (like collect or reduce)

The execution model 1.  Create a DAG of RDDs to
represent computation 2.  Create the logical execution plan for the DAG 3.  Schedule and execute individual tasks

The execution model in action Let’s look at the process
in slow motion. Let’s count distinct names grouped by their initial.

First, create the logical DAG

Then, create the execution plan

So, how is it resilient? RDDs are “just” a lazy,
immutable representation of computation. If something goes wrong, Spark will simply try again.

The ecosystem

Spark Streaming

“Mini-batching” These “mini-batches” are DStreams (discretized streams). It’s just a
collection of RDDs. Create your DStreams from streaming sources. Apply transformations just as you would with RDDs.

CODE!!! Available on Github: https://github.com/stefanobaghino/spark-twitter-stream-example

Homework! •  Caching •  Shared variables •  Partitioning optimization • 
DataFrames •  Know the API •  Learn the ecosystem

THANKS! Psst... we’re hiring! Java? Scala? Big Data? Front End?
Come talk to me afterwards. @stefanobaghino [email protected]

From Big Data to Fast Data (DevFest)

From Big Data to Fast Data (DevFest)

Stefano Baghino

More Decks by Stefano Baghino

Other Decks in Programming

Featured

Transcript

From Big Data to Fast Data Apache Spark Stefano Baghino

Hi! •  Collaborator @ GDG Milano •  Software Engineer @

Wait! Who are you? •  Are you a lambda dancer?

Agenda Big Data? Fast Data? What do we have now?

Big Data Huge and diverse datasets. Just for today, size

What about Fast Data? You have a massive, endless stream

What do we have now? MapReduce is one of the

Meet Apache Spark It works with your cluster. Simple, expressive

Fast? Really? Yup! Hadoop 102.5 TB Spark 100 TB Spark

Enjoy the API anyway you want Java Scala Python

How does it play with my cluster? Local Standalone Hadoop

A lot of interest around it All major Big Data

Where do things go?

Introducing the RDD Say hi to Spark’s main data abstraction,

CODE!!!

The anatomy of an RDD

What about resilience? When you have a large cluster, things

The true nature of an RDD

What can I do with an RDD? Transformations Produce a

The execution model 1.  Create a DAG of RDDs to

The execution model in action Let’s look at the process

First, create the logical DAG

Then, create the execution plan

So, how is it resilient? RDDs are “just” a lazy,

The ecosystem

The ecosystem

Spark Streaming

“Mini-batching” These “mini-batches” are DStreams (discretized streams). It’s just a

CODE!!! Available on Github: https://github.com/stefanobaghino/spark-twitter-stream-example

Homework! •  Caching •  Shared variables •  Partitioning optimization •

THANKS! Psst... we’re hiring! Java? Scala? Big Data? Front End?