Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.
A Better Representation ● more efficient heap usage ● faster iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections)
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations – Manual changes don't scale
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations – Manual changes don't scale – The compiler should do that
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native? Challenge: No means of telling the compiler what/when to speculate
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!
Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" No, it's not a macro. It's a marker for the compiler plugin. (You can't do this with macros)
Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" ● Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Easy one. Do nothing
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Do not introduce any conversions. Even across separate compilation
trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } Scopes Method print in the class implements method print in the trait
Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait
Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait Taken care by the compiler for you!
Retrofitting value class status Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file) Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file) How does mapping work?
Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode Modified user function (automatically derived by the compiler)
Mapping an RDD serialized data encoded data encoded data Modified user function (automatically derived by the compiler) Nowhere near as simple as it looks
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope ● Reuse the machinery in miniboxing scala-miniboxing.org
Challenge: Internal API ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers
Challenge: Internal API ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers ● Solution: Extensive refactoring/rewrite
Prototype Hack ● Modified version of Spark core – RDD data representation is configurable ● It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data!
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes.
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really.
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!
Deforestation and Language Semantics ● Notice that we changed language semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering
Deforestation and Language Semantics ● Such transformations are only acceptable with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in
Code Generation ● Also known as – Deep Embedding – Multi-Stage Programming ● Awesome speedups, but restricted to small DSLs ● SparkSQL uses code gen to improve performance – By 2-4x over Spark
Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles ● Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity
Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity ● Can we restrict macros so they're safer? – Data-centric metaprogramming