Data-centric Metaprogramming @ Spark Summit EU 2015

DATA-CENTRIC METAPROGRAMMING Vlad Ureche

Vlad Ureche PhD in the Scala Team @ EPFL. Soon
to graduate ;) • Working on program transformations focusing on data representation • Author of miniboxing, which improves generics performance by up to 20x • Contributed to the Scala compiler and to the scaladoc tool. @ @VladUreche @VladUreche [email protected] scala-miniboxing.org

Research ahead* ! * This may not make it into
a product. But you can play with it nevertheless.

STOP Please ask if things are not clear!

Motivation Transformation Applications Challenges Conclusion Spark

Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission.

Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission.
Performance gap between RDDs and DataFrames

Motivation RDD DataFrame

Motivation RDD • strongly typed • slower DataFrame

Motivation RDD • strongly typed • slower DataFrame • dynamically
typed • faster

typed • faster ? • strongly typed • faster

typed • faster Dataset • strongly typed • faster

typed • faster Dataset • strongly typed • faster mid-way

typed • faster Dataset • strongly typed • faster mid-way Why just mid-way? What can we do to speed them up?

Object Composition

Object Composition class Vector[T] { … }

Object Composition class Vector[T] { … } The Vector collection
in the Scala library

Object Composition class Employee(...) ID NAME SALARY class Vector[T] {
… } The Vector collection in the Scala library

… } The Vector collection in the Scala library Corresponds to a table row

… }

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME
SALARY ID NAME SALARY class Vector[T] { … }

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME
SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.

A Better Representation Vector[Employee] ID NAME SALARY ID NAME SALARY

A Better Representation NAME ... NAME EmployeeVector ID ID ...
... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

A Better Representation • more efficient heap usage • faster
iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

The Problem • Vector[T] is unaware of Employee

The Problem • Vector[T] is unaware of Employee – Which
makes Vector[Employee] suboptimal

makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected

makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures

makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout

makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout Challenge: No means of communicating this to the compiler

Choice: Safe or Fast

Choice: Safe or Fast This is where my work comes
in...

Data-Centric Metaprogramming • compiler plug-in that allows • Tuning data
representation • Website: scala-ildl.org

Transformation Definition Application

Transformation Definition Application • can't be automated • based on
experience • based on speculation • one-time effort

Transformation programmer Definition Application • can't be automated • based
on experience • based on speculation • one-time effort

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =
Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }

Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } What to transform? What to transform to?

Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to transform?

Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?

on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)

http://infoscience.epfl.ch/record/207050?ln=en

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation?
Composition

Scenario class Employee(...) ID NAME SALARY class Vector[T] { …
}

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY
ID NAME SALARY class Vector[T] { … }

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...

Open World Assumption • Globally anything can happen

Open World Assumption • Globally anything can happen • Locally
you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee

you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?

you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.

Scopes • Can wrap statements, methods, even entire classes –
Inlined immediately after the parser – Definitions are visible outside the "scope"

Scopes • Can wrap statements, methods, even entire classes –
Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values

Composition

Best Representation? Vector[Employee] ID NAME SALARY ID NAME SALARY

Best Representation? It depends. Vector[Employee] ID NAME SALARY ID NAME
SALARY

Best ...? NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Best ...? Tungsten repr. <compressed binary blob> NAME ... NAME
EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Best ...? EmployeeJSON { id: 123, name: “John Doe” salary:
100 } Tungsten repr. <compressed binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee],
by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.

Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.

Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =
for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.

Composition

Composition • Code can be – Left untransformed (using the
original representation) – Transformed using different representations

Composition • Code can be – Left untransformed (using the
original representation) – Transformed using different representations calling • Original code • Transformed code • Original code • Transformed code • Same transformation • Different transformation

Composition calling • Original code • Transformed code • Original
code • Transformed code • Same transformation • Different transformation

code • Transformed code • Same transformation • Different transformation Easy one. Do nothing

code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →

code • Transformed code • Same transformation • Different transformation Hard one. Do not introduce any conversions. Even across separate compilation

code • Transformed code • Same transformation • Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →

Composition calling overriding • Original code • Transformed code •
Original code • Transformed code • Same transformation • Different transformation

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class
EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }

EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } Method print in the class implements method print in the trait

EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)
{ class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } }

{ class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait

{ class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait Taken care by the compiler for you!

Composition

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...
SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster

Retrofitting value class status (3,5) 3 5 Header reference

Retrofitting value class status Tuples in Scala are specialized but
are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Retrofitting value class status 0l + 3 << 32 +
5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Retrofitting value class status 0l + 3 << 32 +
5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4)

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8)

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)
{ List(1,2,3).map(_ + 1).map(_ * 2).sum }

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18

{ List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster

Composition

Research ahead* ! * This may not make it into
a product. But you can play with it nevertheless.

Spark • Optimizations – DataFrames do deforestation – DataFrames do
predicate push-down – DataFrames do code generation • Code is specialized for the data representation • Functions are specialized for the data representation

Spark • Optimizations – RDDs do deforestation – RDDs do
predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation

Spark • Optimizations – RDDs do deforestation – RDDs do
predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation This is what makes them slower

Spark • Optimizations – Datasets do deforestation – Datasets do
predicate push-down – Datasets do code generation • Code is specialized for the data representation • Functions are specialized for the data representation

User Functions X Y user function f

User Functions serialized data encoded data X Y user function
f decode

User Functions serialized data encoded data X Y encoded data
user function f decode encode

user function f decode encode Allocate object Allocate object

user function f decode encode

user function f decode encode Modified user function (automatically derived by the compiler)

User Functions serialized data encoded data encoded data Modified user
function (automatically derived by the compiler)

User Functions serialized data encoded data encoded data Modified user
function (automatically derived by the compiler) Nowhere near as simple as it looks

Challenge: Transformation not possible • Example: Calling outside (untransformed) method

• Solution: Issue compiler warnings

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope

• Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org

Challenge: Internal API changes

Challenge: Internal API changes • Spark internals rely on Iterator[T]
– Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers

Challenge: Internal API changes • Spark internals rely on Iterator[T]
– Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers • Solution: Extensive refactoring/rewrite

Challenge: Automation

Challenge: Automation • Existing code should run out of the
box

box • Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases

box • Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases Where are we now?

Prototype

Prototype Hack

Prototype Hack • Modified version of Spark core – RDD
data representation is configurable

Prototype Hack • Modified version of Spark core – RDD
data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...).
filter(x => ...). collect()

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...).
filter(x => ...). collect() Not yet 2x faster, but 1.45x faster

Composition

Conclusion • Object-oriented composition → inefficient representation

Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric
metaprogramming

metaprogramming – Opaque data → Structured data

metaprogramming – Opaque data → Structured data – Is it possible? Yes.

metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really.

metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!

Thank you! Check out scala-ildl.org.

Deforestation and Language Semantics • Notice that we changed language
semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering

Deforestation and Language Semantics • Such transformations are only acceptable
with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in

Code Generation • Also known as – Deep Embedding –
Multi-Stage Programming • Awesome speedups, but restricted to small DSLs • SparkSQL uses code gen to improve performance – By 2-4x over Spark

Low-level Optimizers • Java JIT Compiler – Access to the
low-level code – Can assume a (local) closed world – Can speculate based on profiles

Low-level Optimizers • Java JIT Compiler – Access to the
low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics

Scala Macros • Many optimizations can be done with macros
– :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity

Scala Macros • Many optimizations can be done with macros
– :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming

Data-centric Metaprogramming @ Spark Summit EU ...

Data-centric Metaprogramming @ Spark Summit EU 2015

More Decks by Vlad Ureche

Other Decks in Programming

Featured

Transcript