Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ReSpark: Automatic Caching for Iterative Applications in Apache Spark

ReSpark: Automatic Caching for Iterative Applications in Apache Spark

Apache Spark is a distributed computing framework used for big data processing. A common pattern in many Spark applications is to iteratively evolve a dataset until reaching some user-specified convergence condition. Unfortunately, some aspects of Spark’s execution model make it difficult for developers who are not familiar with the implementation-level details of Spark to write efficient iterative programs.Since results are constructed iteratively and results from previous iterations may be used multiple times, effective use of caching is necessary to avoid recomputing intermediate results. Currently, developers of Spark applications must manually indicate which intermediate results should be cached. We present a method for using metadata already captured by Spark to automate caching decisions for many Spark programs. We show how this allows Spark applications to benefit from caching without the need for manual caching annotations.

Michael Mior

December 11, 2020
Tweet

More Decks by Michael Mior

Other Decks in Technology

Transcript

  1. ReSpark: Automatic Caching for Iterative Applications in Apache Spark Michael

    Mior • Rochester Institute of Technology Kenneth Salem • University of Waterloo
  2. Apache Spark ▸ Framework for large-scale distributed data processing ▸

    Widely used for analytics tasks ▸ Contains algorithms for graph processing, machine learning, etc. 2
  3. Apache Spark Model ▸ Series of lazy transformations which are

    followed by actions that force evaluation of all transformations ▸ Each step produces a resilient distributed dataset (RDD) ▸ Intermediate results can be cached on memory or disk, optionally serialized 3
  4. Caching is very useful for applications that re-use an RDD

    multiple times. Caching all of the generated RDDs is not a good strategy… Caching is very useful for applications that re-use an RDD multiple times. Caching all of the generated RDDs is not a good strategy… …deciding which ones to cache may be challenging. Spark Caching Best Practices Source: https://unraveldata.com/to-cache-or-not-to-cache/ 4
  5. PageRank Example var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0

    while (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 5
  6. Transformations var rankGraph = graph var iteration = 0 while

    (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph prevRankGraph = rankGraph rankGraph = rankGraph .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() .outerJoinVertices(...).map(...) .aggregateMessages(...) .outerJoinVertices(rankUpdates) 6
  7. Actions var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

    (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() prevRankGraph.unpersist() } rankGraph.edges.foreachPartition(...) rankGraph.vertices.values.sum() 7
  8. Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =

    0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 9 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()
  9. Spark Model Caching var rankGraph = graph.outerJoinVertices(...).map(...) var iteration =

    0 while (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 10 rankGraph.persist() .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist()
  10. Caching ▸ Understanding caching is not easy ▸ Persisting is

    lazy but not unpersisting ▸ Difficult without deep Spark knowledge 11
  11. Unexpected spark caching behavior “…The strange behavior that I'm seeing

    is that spark stages corresponding to val c = a.map(...) are happening 10 times.I would have expected that to happen only once because of the immediate caching on the next line, but that's not the case.…” Source: StackOverflow https://stackoverflow.com/q/30835703/123695 12
  12. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

    (iteration < numIter) { rankGraph.persist() val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) .persist() rankGraph.edges.foreachPartition(...) prevRankGraph.unpersist() } rankGraph.vertices.values.sum() 13
  13. RDDs ▸ Each RDD maintains a lineage or the transformations

    needed to recompute ▸ RDDs in a Spark program are identified by call site (location in program code) ▸ RDDs with the same call site can be expected to have similar behavior 14
  14. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

    (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 15 rankGraph: 0
  15. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 0 while

    (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 16 rankGraph: 2 Persist!
  16. ReSpark var rankGraph = graph.outerJoinVertices(...).map(...) var iteration = 1 while

    (iteration < numIter) { val rankUpdates = rankGraph.aggregateMessages(...) prevRankGraph = rankGraph rankGraph = rankGraph.outerJoinVertices(rankUpdates) } rankGraph.vertices.values.sum() 17 rankGraph: 0 Unpersist!
  17. ReSpark ▸ The use of each RDD based on transformations

    and actions is analyzed for each call site ▸ RDDs created at a call site are persisted if more than one use is expected ▸ RDDs are automatically unpersisted when used the expected amount 18
  18. Evaluation 20 ▸ Tested with multiple applications in the SparkBench

    benchmarking suite ▸ We remove manual cache annotations and instead run with ReSpark ▸ Overhead added is from <2% to ~16% ▸ No manual annotation required!
  19. Future Work 21 ▸ Improved cache replacement policies ▸ Adaptive

    caching based on information available at runtime ▸ Static analysis for information about future transformations