Aggelos Biboudis1 Nick Palladinos2 Yannis Smaragdakis1 1University of Athens 2Nessos Information Technologies, SA 9th ICOOOLPS Implementation, Compilation, Optimization of OO Languages, Programs and Systems Workshop, 2014 July 28th, 2014 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 1 / 27
the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
the Lambdas = lambda translation + standard stream APIs Benchmarks in Scala, Java, C#, F#. Bird’s eye view of the lambda translation techniques. Simple pipelines. Sequential and Parallel, Windows and Linux. Optimizing Frameworks: declarative queries → loop-based imperative code. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27
of lambdas lambda: a class that extends scala.runtime.AbstractFunction[0-22] is generated. lambda w/ free variables: the generated class includes private member fields that get initialized at instantiation time. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 4 / 27
(Views) Scala Views enable lazy transformations on collections def sumOfSquares (v : Array[Double]) : Double = { val sum : Double = v .view .map(d => d * d) .sum sum } In 2.11 views over parallel collections were deprecated. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 5 / 27
(Views) Views for Iterable collections are defined by re-interpreting the iterator method. e.g., 3 virtual calls (next, hasNext, f) per element pointed by the iterator of a map operation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 6 / 27
new API: public double sumOfSquares(double[] v) { double sum = DoubleStream.of(v) .map( d -> d * d ) .sum(); return sum; } A lambda can be used anywhere a Functional Interface is needed. Like inner and anonymous classes, lambda expressions can capture variables. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 7 / 27
on invokedynamic invokedynamic refers to a recipe instead of generating bytecode (1-time cost). Class-generation at compile-time is avoided. Fewer classes for class-loading. Favors inlining optimizations (e.g. in non-capturing lambdas we can get constant-loads). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 8 / 27
The main philosophy is: Stream & a bitset of Characteristics source/generator |> lazy |> lazy |> eager/reduce Intermediate operators are gathered CPS-style. A compact data structure of transformations is applied at the source. Flow of characteristics can be used for optimizations (e.g. ordering a sorted stream). A bulk operation can be optimized (e.g. a do...while loop). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 9 / 27
Translation scheme of lambdas C# lambdas are always assigned to delegates. If they capture free variables, these are fields in a compiler-generated type (otherwise just static-methods). F# lambdas are represented as compiler-generated classes that inherit FSharpFunc<T,R> A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 10 / 27
LINQ introduced in C# 3.0. as fluent-style method calls nums.Where(x => x % 2 == 0).Select(x => x * x).Sum(); with the equivalent query comprehension syntactic sugar (from x in nums where x % 2 == 0 select x * x).Sum(); F# is inspired by OCaml and first class citizen of .NET the Seq module is the Streaming API of F# nums |> Seq.filter (fun x -> x % 2 = 0) |> Seq.map (fun x -> x * x) |> Seq.sum A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 11 / 27
based on IEnumerable/IEnumerator IEnumerable is a factory for IEnumerator objects IEnumerator keeps state of the iteration (Current & MoveNext) e.g., a .Select(func) combinator: Returns a SelectEnumerable object encapsulating the inner source. We get a SelectEnumerator that passes inner’s Current to func. 3 virtual calls (MoveNext, Current, func) per element per iterator. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 12 / 27
Many ideas come from the DB world on query optimization. Various versions of deforestation (Wadler,88). Steno (Murray,11). A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 13 / 27
Box elimination Lambda inlining Basic support for loop fusion Work-stealing tree scheduling Operates at compile time. Full-fledged AST is available for further optimizations. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 14 / 27
Enclose in optimize block for sequential def sumOfSquaresSeqOptimized (v : Array[Double]) : Double = { optimize { val sum : Double = v .map(d => d * d) .sum sum } } two imports & .toPar combinator for parallel A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 15 / 27
Lambda Inlining Loop Fusion Nested Loop generation Tuple Elimination Operates at runtime Transforms LINQ expression trees A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 16 / 27
An example with nested structure: var sum = (from num in nums. AsQueryExpr() from _num in _nums where num % 2 == 0 select num * _num).Sum(). Compile() ; Effectively optimizes to: int sum = 0; for (int index = 0; index < nums.Length; index++) { for (int _index = 0; _index < _nums.Length; _index++) { int num = nums[ index]; int _num = _nums[_index]; if (num % 2 == 0) sum += num * _num; } } A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 17 / 27
Java and Scala A custom tool, named LambdaMicrobenchmarking 2, that follows microbenchmarking practices from P. Sestoft’s Microbenchmarks in Java and C# and uses the same method for reporting standard-deviation and error as JMH. Our project is on github 3 1 http://openjdk.java.net/projects/code-tools/jmh 2 https://github.com/biboudis/LambdaMicrobenchmarking 3 http://biboudis.github.io/clashofthelambdas A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 18 / 27
iterations/10 warmup on forked VM each benchmark. GC collections forced. Compiler optimizations on. TieredCompilation off (see Appendix for a fun-with-flags section). Windows Ubuntu Linux Version 8.1 13.10/3.11.0-24 Architecture x64 x64 CPU Intel Core i5-3360M vPro 2.8GHz Cores 2 physical x 2 logical Memory 4GB Windows Ubuntu Linux Java Java 8 (b132)/JVM 1.8 Scala 2.10.4/JVM 1.8 C# C#5 /CLR v4.0 C# mono 3.4.0.0/mono 3.4.0 F# F#3.1/CLR v4.0 F# open-source 3.0/mono 3.4.0 A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 19 / 27
Input data For N = 10, 000, 000 we used N long integers for all sum* tests. The cart test iterates over an outer array of 1, 000, 000 long integers and an inner one of 10. For ref we used N instances. Sequential and Parallel tests. Windows and Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 20 / 27
Java pays the cost for capturing lambdas with heap allocated objects. Scala stream API (views) pay the cost of boxing/unboxing, iterator and function object abstraction penalties. Scala strict API performs significantly better (although intermediate collections are produced). F# achieved the best scaling of 2.6x-4.3x Scala in the refs tests is directly comparable to other implementations (although some internal boxing effect was revealed). C# and F# scales poorly on mono/Linux. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 24 / 27
all cases. Notable is 52x speedup for the sum. ScalaBlitz shows a 5.7x speedup for the cart mainly due to the work stealing iterator that is used. LinqOptimizer improved all cases of C# / F# benchmarks. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 25 / 27
are measurements affected as a function of the number of processors? Java can benefit from an optimizing framework. C#, F# and Scala Stream APIs can benefit from internal iteration. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 26 / 27
disable adaptive sizing of the heap? Scala parallel tests benefit with 1.1x-2.9x speedups. However ... a 10%-15% performance degredation is caused in the majority of the sequential tests. Tiered compilation had a minor positive effect on Scala sum tests. All other, presented a 10% performance degradation. In Java all sum*, cart tests presented performance degredation. A. Biboudis et al. Clash of the Lambdas ICOOOLPS’14 2 / 27