Clash of the Lambdas

Clash of the Lambdas

This is the slide deck of my talk at ICOOOLPS14.

http://arxiv.org/abs/1406.6631
http://biboudis.github.io/clashofthelambdas/

B81db221127979fbf254c4ffba7ba286?s=128

Aggelos Biboudis

July 28, 2014
Tweet

Transcript

  1. Clash of the Lambdas Through the Lens of Streaming APIs

    Aggelos Biboudis1 Nick Palladinos2 Yannis Smaragdakis1 1University of Athens 2Nessos Information Technologies, SA 9th ICOOOLPS Implementation, Compilation, Optimization of OO Languages, Programs and Systems Workshop, 2014 July 28th, 2014 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 1 / 27
  2. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  3. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  4. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  5. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  6. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  7. Introduction Motivation What do we want to know? Clash of

    the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. Optimizing Frameworks: declarative queries → loop-based imperative code. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  8. Introduction In This Talk In This Talk Implementation Techniques for

    Lambdas and Streaming Scala Java C#/F# Optimizing Frameworks ScalaBlitz LinqOptimizer Microbenchmarking Methodology Tools Setup Benchmarks Results A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 3 / 27
  9. Implementation Techniques for Lambdas and Streaming Scala Scala: Translation scheme

    of lambdas lambda: a class that extends scala.runtime.AbstractFunction[0-22] is generated. lambda w/ free variables: the generated class includes private member fields that get initialized at instantiation time. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 4 / 27
  10. Implementation Techniques for Lambdas and Streaming Scala Scala: Streaming API

    (Views) Scala Views enable lazy transformations on collections def sumOfSquares (v : Array[Double]) : Double = { val sum : Double = v .view .map(d => d * d) .sum sum } In 2.11 views over parallel collections were deprecated. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 5 / 27
  11. Implementation Techniques for Lambdas and Streaming Scala Scala: Streaming API

    (Views) Views for Iterable collections are defined by re-interpreting the iterator method. e.g., 3 virtual calls (next, hasNext, f) per element pointed by the iterator of a map operation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 6 / 27
  12. Implementation Techniques for Lambdas and Streaming Java Java: Lambdas The

    new API: public double sumOfSquares(double[] v) { double sum = DoubleStream.of(v) .map( d -> d * d ) .sum(); return sum; } A lambda can be used anywhere a Functional Interface is needed. Like inner and anonymous classes, lambda expressions can capture variables. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 7 / 27
  13. Implementation Techniques for Lambdas and Streaming Java Java: Translation based

    on invokedynamic invokedynamic refers to a recipe instead of generating bytecode (1-time cost). Class-generation at compile-time is avoided. Fewer classes for class-loading. Favors inlining optimizations (e.g. in non-capturing lambdas we can get constant-loads). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 8 / 27
  14. Implementation Techniques for Lambdas and Streaming Java Java: Streaming API

    The main philosophy is: Stream & a bitset of Characteristics source/generator |> lazy |> lazy |> eager/reduce Intermediate operators are gathered CPS-style. A compact data structure of transformations is applied at the source. Flow of characteristics can be used for optimizations (e.g. ordering a sorted stream). A bulk operation can be optimized (e.g. a do...while loop). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 9 / 27
  15. Implementation Techniques for Lambdas and Streaming C#/F# C# and F#:

    Translation scheme of lambdas C# lambdas are always assigned to delegates. If they capture free variables, these are fields in a compiler-generated type (otherwise just static-methods). F# lambdas are represented as compiler-generated classes that inherit FSharpFunc<T,R> A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 10 / 27
  16. Implementation Techniques for Lambdas and Streaming C#/F# C# and F#

    LINQ introduced in C# 3.0. as fluent-style method calls nums.Where(x => x % 2 == 0).Select(x => x * x).Sum(); with the equivalent query comprehension syntactic sugar (from x in nums where x % 2 == 0 select x * x).Sum(); F# is inspired by OCaml and first class citizen of .NET the Seq module is the Streaming API of F# nums |> Seq.filter (fun x -> x % 2 = 0) |> Seq.map (fun x -> x * x) |> Seq.sum A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 11 / 27
  17. Implementation Techniques for Lambdas and Streaming C#/F# C# and F#:

    based on IEnumerable/IEnumerator IEnumerable is a factory for IEnumerator objects IEnumerator keeps state of the iteration (Current & MoveNext) e.g., a .Select(func) combinator: Returns a SelectEnumerable object encapsulating the inner source. We get a SelectEnumerator that passes inner’s Current to func. 3 virtual calls (MoveNext, Current, func) per element per iterator. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 12 / 27
  18. Implementation Techniques for Lambdas and Streaming Optimizing Frameworks Optimizing Frameworks

    Many ideas come from the DB world on query optimization. Various versions of deforestation (Wadler,88). Steno (Murray,11). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 13 / 27
  19. Implementation Techniques for Lambdas and Streaming Optimizing Frameworks ScalaBlitz Optimizations

    Box elimination Lambda inlining Basic support for loop fusion Work-stealing tree scheduling Operates at compile time. Full-fledged AST is available for further optimizations. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 14 / 27
  20. Implementation Techniques for Lambdas and Streaming Optimizing Frameworks ScalaBlitz: Example

    Enclose in optimize block for sequential def sumOfSquaresSeqOptimized (v : Array[Double]) : Double = { optimize { val sum : Double = v .map(d => d * d) .sum sum } } two imports & .toPar combinator for parallel A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 15 / 27
  21. Implementation Techniques for Lambdas and Streaming Optimizing Frameworks LinqOptimizer Optimizations

    Lambda Inlining Loop Fusion Nested Loop generation Tuple Elimination Operates at runtime Transforms LINQ expression trees A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 16 / 27
  22. Implementation Techniques for Lambdas and Streaming Optimizing Frameworks LinqOptimizer: Example

    An example with nested structure: var sum = (from num in nums. AsQueryExpr() from _num in _nums where num % 2 == 0 select num * _num).Sum(). Compile() ; Effectively optimizes to: int sum = 0; for (int index = 0; index < nums.Length; index++) { for (int _index = 0; _index < _nums.Length; _index++) { int num = nums[ index]; int _num = _nums[_index]; if (num % 2 == 0) sum += num * _num; } } A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 17 / 27
  23. Microbenchmarking Methodology Tools and Reproducibility Java Microbenchmarking Harness 1 for

    Java and Scala A custom tool, named LambdaMicrobenchmarking 2, that follows microbenchmarking practices from P. Sestoft’s Microbenchmarks in Java and C# and uses the same method for reporting standard-deviation and error as JMH. Our project is on github 3 1 http://openjdk.java.net/projects/code-tools/jmh 2 https://github.com/biboudis/LambdaMicrobenchmarking 3 http://biboudis.github.io/clashofthelambdas A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 18 / 27
  24. Microbenchmarking Methodology Setup 3GB fixed heap size for JVM. 10

    iterations/10 warmup on forked VM each benchmark. GC collections forced. Compiler optimizations on. TieredCompilation off (see Appendix for a fun-with-flags section). Windows Ubuntu Linux Version 8.1 13.10/3.11.0-24 Architecture x64 x64 CPU Intel Core i5-3360M vPro 2.8GHz Cores 2 physical x 2 logical Memory 4GB Windows Ubuntu Linux Java Java 8 (b132)/JVM 1.8 Scala 2.10.4/JVM 1.8 C# C#5 /CLR v4.0 C# mono 3.4.0.0/mono 3.4.0 F# F#3.1/CLR v4.0 F# open-source 3.0/mono 3.4.0 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 19 / 27
  25. Microbenchmarking Methodology Benchmarks Tests baseline sum sumOfSquares sumOfSquaresEven cart ref

    Input data For N = 10, 000, 000 we used N long integers for all sum* tests. The cart test iterates over an outer array of 1, 000, 000 long integers and an inner one of 10. For ref we used N instances. Sequential and Parallel tests. Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 20 / 27
  26. Results Benchmarks Results (1/3) A. Biboudis et al. Clash of

    the Lambdas ICOOOLPS’14 21 / 27
  27. Results Benchmarks Results (2/3) A. Biboudis et al. Clash of

    the Lambdas ICOOOLPS’14 22 / 27
  28. Results Benchmarks Results (3/3) A. Biboudis et al. Clash of

    the Lambdas ICOOOLPS’14 23 / 27
  29. Results Observations Observations: Standard libraries Java exhibits the best performance.

    Java pays the cost for capturing lambdas with heap allocated objects. Scala stream API (views) pay the cost of boxing/unboxing, iterator and function object abstraction penalties. Scala strict API performs significantly better (although intermediate collections are produced). F# achieved the best scaling of 2.6x-4.3x Scala in the refs tests is directly comparable to other implementations (although some internal boxing effect was revealed). C# and F# scales poorly on mono/Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 24 / 27
  30. Results Observations Observations: Optimizing frameworks ScalaBlitz improved Scala in virtually

    all cases. Notable is 52x speedup for the sum. ScalaBlitz shows a 5.7x speedup for the cart mainly due to the work stealing iterator that is used. LinqOptimizer improved all cases of C# / F# benchmarks. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 25 / 27
  31. Discussion Future Work Measure the effect of more combinators. How

    are measurements affected as a function of the number of processors? Java can benefit from an optimizing framework. C#, F# and Scala Stream APIs can benefit from internal iteration. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 26 / 27
  32. Discussion Thank You: QnA Play with our code on github.

    Reproduce the results on your H/W and let us know! @biboudis, @NickPalladinos A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 27 / 27
  33. Standard Deviation Windows Linux Benchmark Java Scala-Views Scala-Strict C# F#

    Java Scala-Views Scala-Strict C# F# sumBaseline 0.011 0.015 1.214 0.168 0.054 0.011 0.552 0.818 sumSeq 0.015 0.607 0.277 2.407 0.525 0.014 0.449 0.475 0.359 1.015 sumSeqOpt 0.010 0.536 0.212 0.022 0.248 0.730 sumPar 0.035 2.348 2.622 0.895 4.371 0.009 3.653 1.827 106.800 117.358 sumParOpt 0.017 0.075 0.196 0.026 1.400 2.010 sumOfSquaresBaseline 0.008 0.016 0.129 0.202 0.023 0.013 0.799 1.072 sumOfSquaresSeq 0.009 1.049 2.052 0.763 3.755 0.019 1.331 0.895 1.193 1.116 sumOfSquaresSeqOpt 1.104 0.215 0.292 0.238 0.583 0.171 sumOfSquaresPar 0.008 3.691 9.355 2.745 0.162 0.017 2.807 6.347 23.856 40.342 sumOfSquaresParOpt 0.036 0.433 0.094 0.136 0.782 0.485 sumOfSquaresEvenBaseline 0.044 0.085 0.204 0.393 0.059 0.035 0.906 1.270 sumOfSquaresEvenSeq 0.121 1.157 1.510 3.789 4.838 0.096 1.159 1.042 0.895 1.680 sumOfSquaresEvenSeqOpt 0.550 2.052 5.351 0.162 0.847 0.522 sumOfSquaresEvenPar 0.025 5.184 8.207 5.943 2.556 0.027 4.905 16.252 46.739 21.465 sumOfSquaresEvenParOpt 0.502 0.115 0.128 0.483 1.737 4.390 cartBaseline 0.060 0.041 0.015 1.007 0.010 0.010 0.040 0.113 cartSeq 0.749 6.195 3.939 4.284 5.840 0.510 2.437 5.486 0.954 2.791 cartSeqOpt 0.666 0.148 0.232 0.763 0.751 0.307 cartPar 0.131 13.167 13.165 4.954 7.855 0.243 7.641 7.484 10.963 7.546 cartParOpt 2.694 0.904 1.371 2.642 1.810 1.310 refBaseline 0.069 0.259 0.159 0.360 0.152 0.288 1.740 1.566 refSeq 0.221 1.077 0.719 1.267 3.415 0.237 0.438 0.353 1.269 0.639 refSeqOpt 0.284 2.082 1.437 0.235 2.409 1.643 refPar 0.119 5.123 0.853 8.548 2.556 0.271 6.904 0.765 44.879 27.644 refParOpt 0.247 0.782 0.187 0.112 1.445 2.592 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 1 / 27
  34. Observations: Fun with flags What is the effect if we

    disable adaptive sizing of the heap? Scala parallel tests benefit with 1.1x-2.9x speedups. However ... a 10%-15% performance degredation is caused in the majority of the sequential tests. Tiered compilation had a minor positive effect on Scala sum tests. All other, presented a 10% performance degradation. In Java all sum*, cart tests presented performance degredation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
  35. Java: Streaming API Intermediate operators are gathered CPS-ish: <T, R>

    Stream<R> map(Stream<T> source, Function<T, R> mapper) { return new MapperStream<T, R>(source) { Consumer<T> wrap(Consumer<R> consumer) { return new Consumer<T>() { void accept(T v) { consumer.accept(mapper.apply(v)); } }; } }; } A bulk operation can be optimized: do { consumer.accept(a[i]); } while (++i < hi); A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 3 / 27