ReSpark: Automatic Caching for Iterative Applications in Apache Spark

Slide 1

Slide 1 text

ReSpark: Automatic Caching for Iterative Applications in Apache Spark Michael Mior • Rochester Institute of Technology Kenneth Salem • University of Waterloo

Slide 2

Slide 2 text

Apache Spark ▸ Framework for large-scale distributed data processing ▸ Widely used for analytics tasks ▸ Contains algorithms for graph processing, machine learning, etc. 2

Slide 3

Slide 3 text

Apache Spark Model ▸ Series of lazy transformations which are followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 3

Slide 4

Slide 4 text

Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/ 4

Slide 5

Slide 5 text

PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 5

Slide 6

Slide 6 text

Transformations var rankGraph = graph var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 6

Slide 7

Slide 7 text

Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 7

Slide 8

Slide 8 text

8 PageRank RDDs Some RDDs are used more than once

Slide 9

Slide 9 text

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 9 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()

Slide 10

Slide 10 text

Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 10 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()

Slide 11

Slide 11 text

Caching ▸ Understanding caching is not easy ▸ Persisting is lazy but not unpersisting ▸ Diﬃcult without deep Spark knowledge 11

Slide 12

Slide 12 text

Unexpected spark caching behavior “…The strange behavior that I'm seeing is that spark stages corresponding to val c = a.map(...) are happening 10 times.I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case.…” Source: StackOverﬂow https://stackoverﬂow.com/q/30835703/123695 12

Slide 13

Slide 13 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 13

Slide 14

Slide 14 text

RDDs ▸ Each RDD maintains a lineage or the transformations needed to recompute ▸ RDDs in a Spark program are identiﬁed by call site (location in program code) ▸ RDDs with the same call site can be expected to have similar behavior 14

Slide 15

Slide 15 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 15 rankGraph: 0

Slide 16

Slide 16 text

Slide 17

Slide 17 text

ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 1 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 17 rankGraph: 0 Unpersist!

Slide 18

Slide 18 text

ReSpark ▸ The use of each RDD based on transformations and actions is analyzed for each call site ▸ RDDs created at a call site are persisted if more than one use is expected ▸ RDDs are automatically unpersisted when used the expected amount 18

Slide 19

Slide 19 text

PageRank on ReSpark 19 Without any caching, many jobs take hours!

Slide 20

Slide 20 text

Evaluation 20 ▸ Tested with multiple applications in the SparkBench benchmarking suite ▸ We remove manual cache annotations and instead run with ReSpark ▸ Overhead added is from <2% to ~16% ▸ No manual annotation required!

Slide 21

Slide 21 text

Future Work 21 ▸ Improved cache replacement policies ▸ Adaptive caching based on information available at runtime ▸ Static analysis for information about future transformations

Slide 22

Slide 22 text

Questions?