From Big Data to Fast Data
Apache Spark
Stefano Baghino / GDG Milano
Slide 2
Slide 2 text
Hi!
● Collaborator @ GDG Milano
● Software Engineer @ DATABIZ
I stare at shiny rectangles and push
buttons. Sometimes good things
happen and people pay me for this!
Slide 3
Slide 3 text
Wait! Who are you?
● Are you a lambda dancer?
● Do you have a PB of doges?
Slide 4
Slide 4 text
Agenda
Big Data?
Fast Data?
What do we have now?
How can we do better?
What is Spark?
What does it do?
How does it work?
And also code, somewhere here and there.
Slide 5
Slide 5 text
Big Data
Huge and diverse datasets.
Just for today, size matters.
You need a cluster, you need horizontal scalability.
New problems require new solutions.
Slide 6
Slide 6 text
What about Fast Data?
You have a massive, endless stream of data.
And you can’t wait for a batch computation to complete.
You must have an answer now.
Slide 7
Slide 7 text
What do we have now?
MapReduce is one of the leading cluster computing frameworks.
It scales beautifully, but...
But it does I/O at each step.
And the computational model is too restrictive.
Slide 8
Slide 8 text
Meet Apache Spark
It works with your cluster.
Simple, expressive API with multi-language support.
Computes and caches in-memory. It’s fast!
How does it play with my cluster?
Local Standalone Hadoop Mesos
Slide 12
Slide 12 text
A lot of interest around it
All major Big Data players are supporting it.
Available on the cloud from Spark makers, Databricks.
Also available for Click to Deploy on Google Compute Engine*.
* https://cloud.google.com/solutions/hadoop/click-to-deploy
Slide 13
Slide 13 text
Where do things go?
Slide 14
Slide 14 text
Introducing the RDD
Say hi to Spark’s main data abstraction, the
Resilient Distributed Dataset
From the user’s point of view, the RDD is effectively a collection,
hiding all the details of its distribution throughout the cluster.
Slide 15
Slide 15 text
CODE!!!
Slide 16
Slide 16 text
The anatomy of an RDD
Slide 17
Slide 17 text
What about resilience?
When you have a large cluster, things fail often.
You must have a way to recover from failures.
RDDs hide a little secret.
Slide 18
Slide 18 text
The true nature of an RDD
Slide 19
Slide 19 text
What can I do with an RDD?
Transformations
Produce a new RDD, extending the graph at each step
(map, flatMap, filter and the like)
Actions
They are “terminal” operations, executing the graph
(like collect or reduce)
Slide 20
Slide 20 text
The execution model
1. Create a DAG of RDDs to represent computation
2. Create the logical execution plan for the DAG
3. Schedule and execute individual tasks
Slide 21
Slide 21 text
The execution model in action
Let’s look at the process in slow motion.
Let’s count distinct names grouped by their initial.
Slide 22
Slide 22 text
First, create the logical DAG
Slide 23
Slide 23 text
Then, create the execution plan
Slide 24
Slide 24 text
So, how is it resilient?
RDDs are “just” a lazy, immutable representation of computation.
If something goes wrong, Spark will simply try again.
Slide 25
Slide 25 text
The ecosystem
Slide 26
Slide 26 text
The ecosystem
Slide 27
Slide 27 text
Spark Streaming
Slide 28
Slide 28 text
“Mini-batching”
These “mini-batches” are DStreams (discretized streams).
It’s just a collection of RDDs.
Create your DStreams from streaming sources.
Apply transformations just as you would with RDDs.
Slide 29
Slide 29 text
CODE!!!
Available on Github: https://github.com/stefanobaghino/spark-twitter-stream-example
Slide 30
Slide 30 text
Homework!
● Caching
● Shared variables
● Partitioning optimization
● DataFrames
● Know the API
● Learn the ecosystem
Slide 31
Slide 31 text
THANKS!
Psst... we’re hiring!
Java? Scala? Big Data? Front End? Come talk to me afterwards.
@stefanobaghino
[email protected]