Slide 1

Slide 1 text

Clash of the Lambdas Through the Lens of Streaming APIs Aggelos Biboudis1 Nick Palladinos2 Yannis Smaragdakis1 1University of Athens 2Nessos Information Technologies, SA 9th ICOOOLPS Implementation, Compilation, Optimization of OO Languages, Programs and Systems Workshop, 2014 July 28th, 2014 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 1 / 27

Slide 2

Slide 2 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 3

Slide 3 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 4

Slide 4 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 5

Slide 5 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 6

Slide 6 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 7

Slide 7 text

Introduction Motivation What do we want to know? Clash of the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. Optimizing Frameworks: declarative queries → loop-based imperative code. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 8

Slide 8 text

Introduction In This Talk In This Talk Implementation Techniques for Lambdas and Streaming Scala Java C#/F# Optimizing Frameworks ScalaBlitz LinqOptimizer Microbenchmarking Methodology Tools Setup Benchmarks Results A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 3 / 27

Slide 9

Slide 9 text

Implementation Techniques for Lambdas and Streaming Scala Scala: Translation scheme of lambdas lambda: a class that extends scala.runtime.AbstractFunction[0-22] is generated. lambda w/ free variables: the generated class includes private member fields that get initialized at instantiation time. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 4 / 27

Slide 10

Slide 10 text

Implementation Techniques for Lambdas and Streaming Scala Scala: Streaming API (Views) Scala Views enable lazy transformations on collections def sumOfSquares (v : Array[Double]) : Double = { val sum : Double = v .view .map(d => d * d) .sum sum } In 2.11 views over parallel collections were deprecated. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 5 / 27

Slide 11

Slide 11 text

Implementation Techniques for Lambdas and Streaming Scala Scala: Streaming API (Views) Views for Iterable collections are defined by re-interpreting the iterator method. e.g., 3 virtual calls (next, hasNext, f) per element pointed by the iterator of a map operation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 6 / 27

Slide 12

Slide 12 text

Implementation Techniques for Lambdas and Streaming Java Java: Lambdas The new API: public double sumOfSquares(double[] v) { double sum = DoubleStream.of(v) .map( d -> d * d ) .sum(); return sum; } A lambda can be used anywhere a Functional Interface is needed. Like inner and anonymous classes, lambda expressions can capture variables. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 7 / 27

Slide 13

Slide 13 text

Implementation Techniques for Lambdas and Streaming Java Java: Translation based on invokedynamic invokedynamic refers to a recipe instead of generating bytecode (1-time cost). Class-generation at compile-time is avoided. Fewer classes for class-loading. Favors inlining optimizations (e.g. in non-capturing lambdas we can get constant-loads). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 8 / 27

Slide 14

Slide 14 text

Implementation Techniques for Lambdas and Streaming Java Java: Streaming API The main philosophy is: Stream & a bitset of Characteristics source/generator |> lazy |> lazy |> eager/reduce Intermediate operators are gathered CPS-style. A compact data structure of transformations is applied at the source. Flow of characteristics can be used for optimizations (e.g. ordering a sorted stream). A bulk operation can be optimized (e.g. a do...while loop). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 9 / 27

Slide 15

Slide 15 text

Implementation Techniques for Lambdas and Streaming C#/F# C# and F#: Translation scheme of lambdas C# lambdas are always assigned to delegates. If they capture free variables, these are fields in a compiler-generated type (otherwise just static-methods). F# lambdas are represented as compiler-generated classes that inherit FSharpFunc A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 10 / 27

Slide 16

Slide 16 text

Implementation Techniques for Lambdas and Streaming C#/F# C# and F# LINQ introduced in C# 3.0. as fluent-style method calls nums.Where(x => x % 2 == 0).Select(x => x * x).Sum(); with the equivalent query comprehension syntactic sugar (from x in nums where x % 2 == 0 select x * x).Sum(); F# is inspired by OCaml and first class citizen of .NET the Seq module is the Streaming API of F# nums |> Seq.filter (fun x -> x % 2 = 0) |> Seq.map (fun x -> x * x) |> Seq.sum A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 11 / 27

Slide 17

Slide 17 text

Implementation Techniques for Lambdas and Streaming C#/F# C# and F#: based on IEnumerable/IEnumerator IEnumerable is a factory for IEnumerator objects IEnumerator keeps state of the iteration (Current & MoveNext) e.g., a .Select(func) combinator: Returns a SelectEnumerable object encapsulating the inner source. We get a SelectEnumerator that passes inner’s Current to func. 3 virtual calls (MoveNext, Current, func) per element per iterator. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 12 / 27

Slide 18

Slide 18 text

Implementation Techniques for Lambdas and Streaming Optimizing Frameworks Optimizing Frameworks Many ideas come from the DB world on query optimization. Various versions of deforestation (Wadler,88). Steno (Murray,11). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 13 / 27

Slide 19

Slide 19 text

Implementation Techniques for Lambdas and Streaming Optimizing Frameworks ScalaBlitz Optimizations Box elimination Lambda inlining Basic support for loop fusion Work-stealing tree scheduling Operates at compile time. Full-fledged AST is available for further optimizations. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 14 / 27

Slide 20

Slide 20 text

Implementation Techniques for Lambdas and Streaming Optimizing Frameworks ScalaBlitz: Example Enclose in optimize block for sequential def sumOfSquaresSeqOptimized (v : Array[Double]) : Double = { optimize { val sum : Double = v .map(d => d * d) .sum sum } } two imports & .toPar combinator for parallel A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 15 / 27

Slide 21

Slide 21 text

Implementation Techniques for Lambdas and Streaming Optimizing Frameworks LinqOptimizer Optimizations Lambda Inlining Loop Fusion Nested Loop generation Tuple Elimination Operates at runtime Transforms LINQ expression trees A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 16 / 27

Slide 22

Slide 22 text

Implementation Techniques for Lambdas and Streaming Optimizing Frameworks LinqOptimizer: Example An example with nested structure: var sum = (from num in nums. AsQueryExpr() from _num in _nums where num % 2 == 0 select num * _num).Sum(). Compile() ; Effectively optimizes to: int sum = 0; for (int index = 0; index < nums.Length; index++) { for (int _index = 0; _index < _nums.Length; _index++) { int num = nums[ index]; int _num = _nums[_index]; if (num % 2 == 0) sum += num * _num; } } A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 17 / 27

Slide 23

Slide 23 text

Microbenchmarking Methodology Tools and Reproducibility Java Microbenchmarking Harness 1 for Java and Scala A custom tool, named LambdaMicrobenchmarking 2, that follows microbenchmarking practices from P. Sestoft’s Microbenchmarks in Java and C# and uses the same method for reporting standard-deviation and error as JMH. Our project is on github 3 1 http://openjdk.java.net/projects/code-tools/jmh 2 https://github.com/biboudis/LambdaMicrobenchmarking 3 http://biboudis.github.io/clashofthelambdas A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 18 / 27

Slide 24

Slide 24 text

Microbenchmarking Methodology Setup 3GB fixed heap size for JVM. 10 iterations/10 warmup on forked VM each benchmark. GC collections forced. Compiler optimizations on. TieredCompilation off (see Appendix for a fun-with-flags section). Windows Ubuntu Linux Version 8.1 13.10/3.11.0-24 Architecture x64 x64 CPU Intel Core i5-3360M vPro 2.8GHz Cores 2 physical x 2 logical Memory 4GB Windows Ubuntu Linux Java Java 8 (b132)/JVM 1.8 Scala 2.10.4/JVM 1.8 C# C#5 /CLR v4.0 C# mono 3.4.0.0/mono 3.4.0 F# F#3.1/CLR v4.0 F# open-source 3.0/mono 3.4.0 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 19 / 27

Slide 25

Slide 25 text

Microbenchmarking Methodology Benchmarks Tests baseline sum sumOfSquares sumOfSquaresEven cart ref Input data For N = 10, 000, 000 we used N long integers for all sum* tests. The cart test iterates over an outer array of 1, 000, 000 long integers and an inner one of 10. For ref we used N instances. Sequential and Parallel tests. Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 20 / 27

Slide 26

Slide 26 text

Results Benchmarks Results (1/3) A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 21 / 27

Slide 27

Slide 27 text

Results Benchmarks Results (2/3) A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 22 / 27

Slide 28

Slide 28 text

Results Benchmarks Results (3/3) A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 23 / 27

Slide 29

Slide 29 text

Results Observations Observations: Standard libraries Java exhibits the best performance. Java pays the cost for capturing lambdas with heap allocated objects. Scala stream API (views) pay the cost of boxing/unboxing, iterator and function object abstraction penalties. Scala strict API performs significantly better (although intermediate collections are produced). F# achieved the best scaling of 2.6x-4.3x Scala in the refs tests is directly comparable to other implementations (although some internal boxing effect was revealed). C# and F# scales poorly on mono/Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 24 / 27

Slide 30

Slide 30 text

Results Observations Observations: Optimizing frameworks ScalaBlitz improved Scala in virtually all cases. Notable is 52x speedup for the sum. ScalaBlitz shows a 5.7x speedup for the cart mainly due to the work stealing iterator that is used. LinqOptimizer improved all cases of C# / F# benchmarks. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 25 / 27

Slide 31

Slide 31 text

Discussion Future Work Measure the effect of more combinators. How are measurements affected as a function of the number of processors? Java can benefit from an optimizing framework. C#, F# and Scala Stream APIs can benefit from internal iteration. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 26 / 27

Slide 32

Slide 32 text

Discussion Thank You: QnA Play with our code on github. Reproduce the results on your H/W and let us know! @biboudis, @NickPalladinos A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 27 / 27

Slide 33

Slide 33 text

Standard Deviation Windows Linux Benchmark Java Scala-Views Scala-Strict C# F# Java Scala-Views Scala-Strict C# F# sumBaseline 0.011 0.015 1.214 0.168 0.054 0.011 0.552 0.818 sumSeq 0.015 0.607 0.277 2.407 0.525 0.014 0.449 0.475 0.359 1.015 sumSeqOpt 0.010 0.536 0.212 0.022 0.248 0.730 sumPar 0.035 2.348 2.622 0.895 4.371 0.009 3.653 1.827 106.800 117.358 sumParOpt 0.017 0.075 0.196 0.026 1.400 2.010 sumOfSquaresBaseline 0.008 0.016 0.129 0.202 0.023 0.013 0.799 1.072 sumOfSquaresSeq 0.009 1.049 2.052 0.763 3.755 0.019 1.331 0.895 1.193 1.116 sumOfSquaresSeqOpt 1.104 0.215 0.292 0.238 0.583 0.171 sumOfSquaresPar 0.008 3.691 9.355 2.745 0.162 0.017 2.807 6.347 23.856 40.342 sumOfSquaresParOpt 0.036 0.433 0.094 0.136 0.782 0.485 sumOfSquaresEvenBaseline 0.044 0.085 0.204 0.393 0.059 0.035 0.906 1.270 sumOfSquaresEvenSeq 0.121 1.157 1.510 3.789 4.838 0.096 1.159 1.042 0.895 1.680 sumOfSquaresEvenSeqOpt 0.550 2.052 5.351 0.162 0.847 0.522 sumOfSquaresEvenPar 0.025 5.184 8.207 5.943 2.556 0.027 4.905 16.252 46.739 21.465 sumOfSquaresEvenParOpt 0.502 0.115 0.128 0.483 1.737 4.390 cartBaseline 0.060 0.041 0.015 1.007 0.010 0.010 0.040 0.113 cartSeq 0.749 6.195 3.939 4.284 5.840 0.510 2.437 5.486 0.954 2.791 cartSeqOpt 0.666 0.148 0.232 0.763 0.751 0.307 cartPar 0.131 13.167 13.165 4.954 7.855 0.243 7.641 7.484 10.963 7.546 cartParOpt 2.694 0.904 1.371 2.642 1.810 1.310 refBaseline 0.069 0.259 0.159 0.360 0.152 0.288 1.740 1.566 refSeq 0.221 1.077 0.719 1.267 3.415 0.237 0.438 0.353 1.269 0.639 refSeqOpt 0.284 2.082 1.437 0.235 2.409 1.643 refPar 0.119 5.123 0.853 8.548 2.556 0.271 6.904 0.765 44.879 27.644 refParOpt 0.247 0.782 0.187 0.112 1.445 2.592 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 1 / 27

Slide 34

Slide 34 text

Observations: Fun with flags What is the effect if we disable adaptive sizing of the heap? Scala parallel tests benefit with 1.1x-2.9x speedups. However ... a 10%-15% performance degredation is caused in the majority of the sequential tests. Tiered compilation had a minor positive effect on Scala sum tests. All other, presented a 10% performance degradation. In Java all sum*, cart tests presented performance degredation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27

Slide 35

Slide 35 text

Java: Streaming API Intermediate operators are gathered CPS-ish: Stream map(Stream source, Function mapper) { return new MapperStream(source) { Consumer wrap(Consumer consumer) { return new Consumer() { void accept(T v) { consumer.accept(mapper.apply(v)); } }; } }; } A bulk operation can be optimized: do { consumer.accept(a[i]); } while (++i < hi); A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 3 / 27