ReSpark: Automatic Caching for Iterative Applications in Apache Spark

ReSpark: Automatic Caching for Iterative Applications in Apache Spark Michael
Mior • Rochester Institute of Technology Kenneth Salem • University of Waterloo

Apache Spark ▸ Framework for large-scale distributed data processing ▸
Widely used for analytics tasks ▸ Contains algorithms for graph processing, machine learning, etc. 2

Apache Spark Model ▸ Series of lazy transformations which are
followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 3

Caching is very useful for applications that re-use an RDD
multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/ 4

PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0
while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 5

Transformations var rankGraph = graph var iteration = 0 while
(iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 6

Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while
(iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 7

8 PageRank RDDs Some RDDs are used more than once

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =
0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 9 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =
0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 10 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()

Caching ▸ Understanding caching is not easy ▸ Persisting is
lazy but not unpersisting ▸ Diﬃcult without deep Spark knowledge 11

Unexpected spark caching behavior “…The strange behavior that I'm seeing
is that spark stages corresponding to val c = a.map(...) are happening 10 times.I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case.…” Source: StackOverﬂow https://stackoverﬂow.com/q/30835703/123695 12

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while
(iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 13

RDDs ▸ Each RDD maintains a lineage or the transformations
needed to recompute ▸ RDDs in a Spark program are identiﬁed by call site (location in program code) ▸ RDDs with the same call site can be expected to have similar behavior 14

(iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 15 rankGraph: 0

(iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 16 rankGraph: 2 Persist!

(iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 17 rankGraph: 0 Unpersist!

ReSpark ▸ The use of each RDD based on transformations
and actions is analyzed for each call site ▸ RDDs created at a call site are persisted if more than one use is expected ▸ RDDs are automatically unpersisted when used the expected amount 18

PageRank on ReSpark 19 Without any caching, many jobs take
hours!

Evaluation 20 ▸ Tested with multiple applications in the SparkBench
benchmarking suite ▸ We remove manual cache annotations and instead run with ReSpark ▸ Overhead added is from <2% to ~16% ▸ No manual annotation required!

Future Work 21 ▸ Improved cache replacement policies ▸ Adaptive
caching based on information available at runtime ▸ Static analysis for information about future transformations

Questions?

ReSpark: Automatic Caching for Iterative Applic...

ReSpark: Automatic Caching for Iterative Applications in Apache Spark

Michael Mior

More Decks by Michael Mior

Other Decks in Technology

Featured

Transcript

ReSpark: Automatic Caching for Iterative Applications in Apache Spark Michael

Apache Spark ▸ Framework for large-scale distributed data processing ▸

Apache Spark Model ▸ Series of lazy transformations which are

Caching is very useful for applications that re-use an RDD

PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0

Transformations var rankGraph = graph var iteration = 0 while

Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

8 PageRank RDDs Some RDDs are used more than once

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =

Caching ▸ Understanding caching is not easy ▸ Persisting is

Unexpected spark caching behavior “…The strange behavior that I'm seeing

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

RDDs ▸ Each RDD maintains a lineage or the transformations

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 1 while

ReSpark ▸ The use of each RDD based on transformations

PageRank on ReSpark 19 Without any caching, many jobs take

Evaluation 20 ▸ Tested with multiple applications in the SparkBench

Future Work 21 ▸ Improved cache replacement policies ▸ Adaptive

Questions?