Data-centric Metaprogramming - Scala Days 2016

Data-centric Metaprogramming Vlad Ureche

Vlad Ureche @ @VladUreche @VladUreche [email protected]

Vlad Ureche Software Engineer at Cyberhaven.io scala-miniboxing.org Ex-Scala Team at
EPFL

STOP Please ask if things are not clear!

Motivation Transformation Applications Challenges Conclusion Functions

Object Composition

Object Composition class Vector[T] { … }

Object Composition class Vector[T] { … } The Vector collection
in the Scala library

Object Composition class Employee(...) ID NAME SALARY class Vector[T] {
… } The Vector collection in the Scala library

… } The Vector collection in the Scala library Corresponds to a table row

… }

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME
SALARY ID NAME SALARY class Vector[T] { … }

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME
SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.

A Better Representation Vector[Employee] ID NAME SALARY ID NAME SALARY

A Better Representation NAME ... NAME EmployeeVector ID ID ...
... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

A Better Representation • more efficient heap usage • faster
iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

The Problem • Vector[T] is unaware of Employee

The Problem • Vector[T] is unaware of Employee – Which
makes Vector[Employee] suboptimal

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections)

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale

makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale – The compiler should do that

Current Optimizers

Current Optimizers What about the Scala.js optimizer?

Current Optimizers What about the Scala.js optimizer? What about the
Dotty Linker?

Current Optimizers What about the Scala.js optimizer? What about the
Dotty Linker? Scala Native?

Current Optimizers • They do a great job What about
the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Current Optimizers • They do a great job – But
have to respect semantics What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case – Have to be conservative :( What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native? Challenge: No means of telling the compiler what/when to speculate

Choice: Safe or Fast

Choice: Safe or Fast This is where my work comes
in...

Data-Centric Metaprogramming • compiler plug-in that allows • Tuning data
representation • Website: scala-ildl.org

Transformation Definition Application

Transformation Definition Application • can't be automated • based on
experience • based on speculation • one-time effort

Transformation programmer Definition Application • can't be automated • based
on experience • based on speculation • one-time effort

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =
Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }

object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type
Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming What to transform? What to transform to?

object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type
Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming How to transform?

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =
Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)

http://infoscience.epfl.ch/record/207050?ln=en

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?
Composition

Scenario class Employee(...) ID NAME SALARY class Vector[T] { …
}

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY
ID NAME SALARY class Vector[T] { … }

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...

Open World Assumption • Globally anything can happen

Open World Assumption • Globally anything can happen • Locally
you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee

you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?

you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.

Scopes • Can wrap statements, methods, even entire classes –
Inlined immediately after the parser – Definitions are visible outside the "scope"

Inlined immediately after the parser – Definitions are visible outside the "scope" No, it's not a macro. It's a marker for the compiler plugin. (You can't do this with macros)

Inlined immediately after the parser – Definitions are visible outside the "scope"

Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values

Composition

Best Representation? Vector[Employee] ID NAME SALARY ID NAME SALARY

Best Representation? It depends. Vector[Employee] ID NAME SALARY ID NAME
SALARY

Best ...? NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Best ...? Compact binary repr. <compact binary blob> NAME ...
NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Best ...? EmployeeJSON { id: 123, name: “John Doe” salary:
100 } Compact binary repr. <compact binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.

Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.

Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.

Composition

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01)

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def
indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... }

indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation)

indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation)

indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them

indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them ???

Composition calling • Original code • Transformed code • Original
code • Transformed code • Same transformation • Different transformation

code • Transformed code • Same transformation • Different transformation Easy one. Do nothing

code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →

code • Transformed code • Same transformation • Different transformation Hard one. Do not introduce any conversions. Even across separate compilation

code • Transformed code • Same transformation • Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →

Composition calling overriding • Original code • Transformed code •
Original code • Transformed code • Same transformation • Different transformation

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit }

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class
EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }

trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter
extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } Scopes Method print in the class implements method print in the trait

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class
EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)
{ class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } }

{ class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait

{ class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait Taken care by the compiler for you!

Composition

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster

Retrofitting value class status (3,5) 3 5 Header reference

Retrofitting value class status Tuples in Scala are specialized but
are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Retrofitting value class status 0l + 3 << 32 +
5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Retrofitting value class status 0l + 3 << 32 +
5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4)

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8)

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)
{ List(1,2,3).map(_ + 1).map(_ * 2).sum }

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster

Composition

Research ahead* ! * This may not make it into
a product. But you can play with it nevertheless.

Spark RDD (Reliable Distributed Dataset)

Spark RDD (Reliable Distributed Dataset) Key abstraction in Spark

Spark RDD (Reliable Distributed Dataset)

Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file)

Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file) How does mapping work?

Mapping an RDD X Y user function f

Mapping an RDD serialized data encoded data X Y user
function f decode

Mapping an RDD serialized data encoded data X Y encoded
data user function f decode encode

data user function f decode encode Allocate object Allocate object

data user function f decode encode

data user function f decode encode Modified user function (automatically derived by the compiler)

Mapping an RDD serialized data encoded data encoded data Modified
user function (automatically derived by the compiler)

Mapping an RDD serialized data encoded data encoded data Modified
user function (automatically derived by the compiler) Nowhere near as simple as it looks

Challenge: Transformation not possible • Example: Calling outside (untransformed) method

• Solution: Issue compiler warnings

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org

Challenge: Internal API

Challenge: Internal API • Spark internals rely on Iterator[T] –
Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers

Challenge: Internal API • Spark internals rely on Iterator[T] –
Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers • Solution: Extensive refactoring/rewrite

Prototype

Prototype Hack

Prototype Hack • Modified version of Spark core – RDD
data representation is configurable

Prototype Hack • Modified version of Spark core – RDD
data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...).
filter(x => ...). collect()

sc.parallelize(/* 1 million */ records). map(x => ...). filter(x =>
...). collect() Prototype Hack More details in my talk at Spark Summit EU 2015

Composition

Conclusion • Object-oriented composition → inefficient representation

Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric
metaprogramming

metaprogramming – Use the best representation for your data!

metaprogramming – Use the best representation for your data! – Is it possible? Yes.

metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really.

metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!

Thank you! Check out scala-ildl.org.

Deforestation and Language Semantics • Notice that we changed language
semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering

Deforestation and Language Semantics • Such transformations are only acceptable
with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in

Code Generation • Also known as – Deep Embedding –
Multi-Stage Programming • Awesome speedups, but restricted to small DSLs • SparkSQL uses code gen to improve performance – By 2-4x over Spark

Low-level Optimizers • Java JIT Compiler – Access to the
low-level code – Can assume a (local) closed world – Can speculate based on profiles

Low-level Optimizers • Java JIT Compiler – Access to the
low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics

Scala Macros • Many optimizations can be done with macros
– :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity

Scala Macros • Many optimizations can be done with macros
– :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming

Data-centric Metaprogramming - Scala Days 2016

Data-centric Metaprogramming - Scala Days 2016

More Decks by Vlad Ureche

Other Decks in Programming

Featured

Transcript