to graduate ;) • Working on program transformations focusing on data representation • Author of miniboxing, which improves generics performance by up to 20x • Contributed to the Scala compiler and to the scaladoc tool. @ @VladUreche @VladUreche [email protected] scala-miniboxing.org
makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout
makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout Challenge: No means of communicating this to the compiler
on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
100 } Tungsten repr. <compressed binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
original representation) – Transformed using different representations calling • Original code • Transformed code • Original code • Transformed code • Same transformation • Different transformation
code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } Method print in the class implements method print in the trait
{ class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait
{ class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait Taken care by the compiler for you!
5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
predicate push-down – DataFrames do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation This is what makes them slower
predicate push-down – Datasets do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org
box • Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases Where are we now?
data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
– :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming