Matei
Zaharia
and
Reynold
Xin
University
of
California,
Berkeley
www.spark-‐project.org
The
Ecosystem
Fast
and
Expressive
Big
Data
Analytics
in
Scala
UC
BERKELEY
Slide 2
Slide 2 text
What
is
Spark?
Fast
and
expressive
cluster
computing
system
interoperable
with
Apache
Hadoop
Improves
efficiency
through:
» In-‐memory
computing
primitives
» General
computation
graphs
Improves
usability
through:
» Rich
APIs
in
Scala,
Java,
Python
» Interactive
shell
Up
to
100×
faster
(2-‐10×
on
disk)
Often
5×
less
code
Slide 3
Slide 3 text
Project
History
Spark
started
in
2009,
open
sourced
2010
In
use
at
Intel,
Yahoo!,
Adobe,
Quantifind,
Conviva,
Ooyala,
Bizo
and
others
17
companies
now
contributing
code
Slide 4
Slide 4 text
A
Growing
Stack
Part
of
the
Berkeley
Data
Analytics
Stack
(BDAS)
project
to
build
an
open
source
next-‐gen
analytics
system
Spark
Shark
SQL
Spark
Streaming
real-‐time
GraphX
graph
MLbase
machine
learning
…
Slide 5
Slide 5 text
This
Talk
Spark
introduction
&
use
cases
GraphX:
graph
computation
Shark:
SQL
over
Spark
See
tomorrow
for
a
talk
on
Streaming!
Slide 6
Slide 6 text
Why
a
New
Programming
Model?
MapReduce
greatly
simplified
big
data
analysis
But
as
soon
as
it
got
popular,
users
wanted
more:
» More
complex,
multi-‐pass
analytics
(e.g.
ML,
graph)
» More
interactive
ad-‐hoc
queries
» More
real-‐time
stream
processing
All
3
need
faster
data
sharing
across
parallel
jobs
Slide 7
Slide 7 text
Data
Sharing
in
MapReduce
iter.
1
iter.
2
.
.
.
Input
HDFS
read
HDFS
write
HDFS
read
HDFS
write
Input
query
1
query
2
query
3
result
1
result
2
result
3
.
.
.
HDFS
read
Slow
due
to
replication,
serialization,
and
disk
IO
Slide 8
Slide 8 text
iter.
1
iter.
2
.
.
.
Input
Data
Sharing
in
Spark
Distributed
memory
Input
query
1
query
2
query
3
.
.
.
one-‐time
processing
10-‐100×
faster
than
network
and
disk
Slide 9
Slide 9 text
Spark
Programming
Model
Key
idea:
resilient
distributed
datasets
(RDDs)
» Distributed
collections
of
objects
that
can
be
cached
in
memory
across
cluster
» Manipulated
through
parallel
operators
» Automatically
recomputed
on
failure
Programming
interface
» Functional
APIs
in
Scala,
Java,
Python
» Interactive
use
from
Scala
shell
Slide 10
Slide 10 text
Example:
Log
Mining
Load
error
messages
from
a
log
into
memory,
then
interactively
search
for
various
patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
messages.cache()
Block
1
Block
2
Block
3
Worker
Worker
Worker
Driver
messages.filter(_.contains(“foo”)).count
messages.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache
1
Cache
2
Cache
3
Base
RDD
Transformed
RDD
Action
Result:
full-‐text
search
of
Wikipedia
in
<1
sec
(vs
20
sec
for
on-‐disk
data)
Result:
scaled
to
1
TB
data
in
5-‐7
sec
(vs
170
sec
for
on-‐disk
data)
Slide 11
Slide 11 text
Fault
Tolerance
RDDs
track
the
series
of
transformations
used
to
build
them
(their
lineage)
to
recompute
lost
data
E.g:
messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDD
path
=
hdfs://…
FilteredRDD
func
=
_.contains(...)
MappedRDD
func
=
_.split(…)
Slide 12
Slide 12 text
Example:
Logistic
Regression
Goal:
find
best
line
separating
two
sets
of
points
+
–
+
+
+
+
+
+
+
+
– –
–
–
–
–
–
–
+
target
–
random
initial
line
Slide 13
Slide 13 text
Example:
Logistic
Regression
val data = spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
for (i <- 1 to ITERATIONS) {
val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final w: " + w)
Slide 14
Slide 14 text
Logistic
Regression
Performance
0
500
1000
1500
2000
2500
3000
3500
4000
1
5
10
20
30
Running
Time
(s)
Number
of
Iterations
Hadoop
Spark
110
s
/
iteration
first
iteration
80
s
further
iterations
1
s
Slide 15
Slide 15 text
Demo
Slide 16
Slide 16 text
Supported
Operators
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
Slide 17
Slide 17 text
Other
Engine
Features
General
operator
graphs
(e.g.
map-‐reduce-‐reduce)
Hash-‐based
reduces
(faster
than
Hadoop’s
sort)
Controlled
data
partitioning
to
lower
communication
171
72
23
0
50
100
150
200
Iteration
time
(s)
PageRank
Performance
Hadoop
Basic
Spark
Spark
+
Controlled
Partitioning
Slide 18
Slide 18 text
700+
meetup
members
30+
external
contributors
17
companies
contributing
Spark
Community
Slide 19
Slide 19 text
This
Talk
Spark
introduction
&
use
cases
GraphX:
graph
computation
Shark:
SQL
over
Spark
Slide 20
Slide 20 text
Graphs
are
Essential
to
Data
Mining
Identify
influential
people
and
information
Find
communities
Target
ads
and
products
Model
complex
data
dependencies
Slide 21
Slide 21 text
Pregel
Specialized
Graph
Systems
Slide 22
Slide 22 text
B
C
D
E
F
A
Specialized
Graph
Systems
1. APIs
to
capture
complex
dependencies
i.e.
graph
parallelism
vs
data
parallelism
2. Exploit
graph
structure
to
reduce
communication
and
computation
Slide 23
Slide 23 text
How
is
GraphX
different
from
____?
Answer:
Simplicity
Slide 24
Slide 24 text
Simplicity
Integration
with
Spark:
no
disparate
system
» ETL
(Extract,
transform,
load)
» Consumption
of
graph
output
» Fault-‐tolerance
» Use
the
Scala
REPL
for
interactive
graph
mining
Programmability:
leveraging
Scala/Spark
API
» Implemented
GraphLab
/
Pregel
APIs
in
20
loc
» PageRank
in
5
loc
Slide 25
Slide 25 text
Resilient
Distributed
Graphs
An
extension
of
Spark
RDDs
» Immutable,
partitioned
set
of
vertices
and
edges
» Constructed
using
RDD[Edge]
and
RDD[Vertex]
Additional
set
of
primitives
(3
functions)
for
graph
computations
» Able
to
express
most
graph
algorithms
(PageRank,
Shortest
Path,
Connected
Components,
ALS,
…)
» Implemented
GraphLab
/
Pregel
in
20
lines
of
code
Slide 26
Slide 26 text
vertices = spark.textFile("hdfs://path/pages.csv")
edges = spark.textFile("hdfs://path/to/links.csv”)
.map(line => new Edge(line.split(‘\t’))
g = new Graph(vertices, edges).cache
println(g.vertices.count)
println(g.edges.count)
g1 = g.filterVertices(_.split('\t')(2) == "Berkeley")
ranks = Analytics.pageRank(g1, numIter = 10)
println(ranks.vertices.sum)
Resilient
Distributed
Graph
Pregel
API
PageRank
GraphLab
API
Shortest
Path
Connected
Components
ALS
GraphX
Slide 29
Slide 29 text
Early
Performance
Benefits
from
Spark’s:
» In-‐memory
caching
» Hash-‐based
operators
» Controlled
data
partitioning
1340
165
0
200
400
600
800
1000
1200
1400
1600
Hadoop
GraphX
PageRank,
16
nodes
Alpha
coming
in
June
/
July!
Slide 30
Slide 30 text
This
Talk
Spark
introduction
&
use
cases
GraphX:
graph
computation
Shark:
SQL
over
Spark
Slide 31
Slide 31 text
What
is
Shark?
Columnar
SQL
analytics
engine
for
Spark
» Support
both
SQL
and
complex
analytics
» Up
to
100X
faster
than
Apache
Hive
Compatible
with
Apache
Hive
» HiveQL,
UDF/UDAF,
SerDes,
Scripts
» Runs
on
existing
Hive
warehouses
In
use
at
Yahoo!
for
fast
in-‐memory
OLAP
Spark
Integration
Unified
system
for
SQL,
graph
processing,
machine
learning
All
share
the
same
set
of
workers
and
caches
def logRegress(points: RDD[Point]): Vector {
var w = Vector(D, _ => 2 * rand.nextDouble - 1)
for (i <- 1 to ITERATIONS) {
val gradient = points.map { p =>
val denom = 1 + exp(-p.y * (w dot p.x))
(1 / denom - 1) * p.y * p.x
}.reduce(_ + _)
w -= gradient
}
w
}
val users = sql2rdd("SELECT * FROM user u
JOIN comment c ON c.uid=u.uid")
val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")),
extractFeature2(row.getStr("country")),
...)}
val trainedVector = logRegress(features.cache())
Slide 34
Slide 34 text
Teaser:
Spark
Streaming
sc.twitterStream(...)
.flatMap(_.getText.split(“ ”))
.map(word => (word, 1))
.reduceByWindow(“5s”, _ + _)
Come
see
our
talk
tomorrow
at
2:30!
Slide 35
Slide 35 text
Getting
Started
Visit
www.spark-‐project.org
for
» Video
tutorials
» Online
exercises
(EC2)
» Docs
and
API
guides
Easy
to
run
in
local
mode,
standalone
clusters,
Apache
Mesos,
YARN
or
EC2
Training
camp
at
Berkeley
in
August
Slide 36
Slide 36 text
Conclusion
Big
data
analytics
is
evolving
to
include:
» More
complex
analytics
(e.g.
machine
learning)
» More
interactive
ad-‐hoc
queries
» More
real-‐time
stream
processing
Spark
is
a
fast,
unified
platform
for
these
apps
Look
for
our
training
camp
at
Berkeley
this
August!
spark-‐project.org
Slide 37
Slide 37 text
Backup
Slides
Slide 38
Slide 38 text
Behavior
with
Not
Enough
RAM
68.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
Cache
disabled
25%
50%
75%
Fully
cached
Iteration
time
(s)
%
of
working
set
in
memory
Slide 39
Slide 39 text
Fault
Tolerance
file.map(rec => (rec.type, 1))
.reduceByKey(_ + _)
.filter((type, count) => count > 10)
filter
reduce
map
Input
file
RDDs
track
lineage
information
to
rebuild
on
failure
Slide 40
Slide 40 text
filter
reduce
map
Input
file
Fault
Tolerance
file.map(rec => (rec.type, 1))
.reduceByKey(_ + _)
.filter((type, count) => count > 10)
RDDs
track
lineage
information
to
rebuild
on
failure