Slide 1

Slide 1 text

Data-centric Metaprogramming Vlad Ureche

Slide 2

Slide 2 text

Vlad Ureche @ @VladUreche @VladUreche [email protected]

Slide 3

Slide 3 text

Vlad Ureche Software Engineer at Cyberhaven.io scala-miniboxing.org Ex-Scala Team at EPFL

Slide 4

Slide 4 text

STOP Please ask if things are not clear!

Slide 5

Slide 5 text

Motivation Transformation Applications Challenges Conclusion Functions

Slide 6

Slide 6 text

Object Composition

Slide 7

Slide 7 text

Object Composition class Vector[T] { … }

Slide 8

Slide 8 text

Object Composition class Vector[T] { … } The Vector collection in the Scala library

Slide 9

Slide 9 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … } The Vector collection in the Scala library

Slide 10

Slide 10 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … } The Vector collection in the Scala library Corresponds to a table row

Slide 11

Slide 11 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 12

Slide 12 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 13

Slide 13 text

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … }

Slide 14

Slide 14 text

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.

Slide 15

Slide 15 text

A Better Representation Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 16

Slide 16 text

A Better Representation NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 17

Slide 17 text

A Better Representation ● more efficient heap usage ● faster iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 18

Slide 18 text

The Problem ● Vector[T] is unaware of Employee

Slide 19

Slide 19 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal

Slide 20

Slide 20 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected

Slide 21

Slide 21 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections)

Slide 22

Slide 22 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions

Slide 23

Slide 23 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations

Slide 24

Slide 24 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations – Manual changes don't scale

Slide 25

Slide 25 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions ● We know better representations – Manual changes don't scale – The compiler should do that

Slide 26

Slide 26 text

Current Optimizers

Slide 27

Slide 27 text

Current Optimizers What about the Scala.js optimizer?

Slide 28

Slide 28 text

Current Optimizers What about the Scala.js optimizer? What about the Dotty Linker?

Slide 29

Slide 29 text

Current Optimizers What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 30

Slide 30 text

Current Optimizers ● They do a great job What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 31

Slide 31 text

Current Optimizers ● They do a great job – But have to respect semantics What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 32

Slide 32 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 33

Slide 33 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 34

Slide 34 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 35

Slide 35 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 36

Slide 36 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?

Slide 37

Slide 37 text

Current Optimizers ● They do a great job – But have to respect semantics – Support every corner case – Have to be conservative :( ● Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native? Challenge: No means of telling the compiler what/when to speculate

Slide 38

Slide 38 text

Choice: Safe or Fast

Slide 39

Slide 39 text

Choice: Safe or Fast This is where my work comes in...

Slide 40

Slide 40 text

Data-Centric Metaprogramming ● compiler plug-in that allows ● Tuning data representation ● Website: scala-ildl.org

Slide 41

Slide 41 text

Motivation Transformation Applications Challenges Conclusion Functions

Slide 42

Slide 42 text

Transformation Definition Application

Slide 43

Slide 43 text

Transformation Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort

Slide 44

Slide 44 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort

Slide 45

Slide 45 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone

Slide 46

Slide 46 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 47

Slide 47 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 48

Slide 48 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }

Slide 49

Slide 49 text

object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming What to transform? What to transform to?

Slide 50

Slide 50 text

object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming How to transform?

Slide 51

Slide 51 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?

Slide 52

Slide 52 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 53

Slide 53 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 54

Slide 54 text

http://infoscience.epfl.ch/record/207050?ln=en

Slide 55

Slide 55 text

Motivation Transformation Applications Challenges Conclusion Functions

Slide 56

Slide 56 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 57

Slide 57 text

Scenario class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 58

Slide 58 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … }

Slide 59

Slide 59 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY

Slide 60

Slide 60 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

Slide 61

Slide 61 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

Slide 62

Slide 62 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...

Slide 63

Slide 63 text

Open World Assumption ● Globally anything can happen

Slide 64

Slide 64 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee

Slide 65

Slide 65 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?

Slide 66

Slide 66 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!

Slide 67

Slide 67 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )

Slide 68

Slide 68 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

Slide 69

Slide 69 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.

Slide 70

Slide 70 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope"

Slide 71

Slide 71 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" No, it's not a macro. It's a marker for the compiler plugin. (You can't do this with macros)

Slide 72

Slide 72 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope"

Slide 73

Slide 73 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" ● Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values

Slide 74

Slide 74 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 75

Slide 75 text

Best Representation? Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 76

Slide 76 text

Best Representation? It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 77

Slide 77 text

Best ...? NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 78

Slide 78 text

Best ...? Compact binary repr. NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 79

Slide 79 text

Best ...? EmployeeJSON { id: 123, name: “John Doe” salary: 100 } Compact binary repr. NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 80

Slide 80 text

Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

Slide 81

Slide 81 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.

Slide 82

Slide 82 text

Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.

Slide 83

Slide 83 text

Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.

Slide 84

Slide 84 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 85

Slide 85 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01)

Slide 86

Slide 86 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... }

Slide 87

Slide 87 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... }

Slide 88

Slide 88 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } ● Original code (using the default representation)

Slide 89

Slide 89 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } ● Original code (using the default representation) ● Transformed code (using a different representation)

Slide 90

Slide 90 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } ● Original code (using the default representation) ● Transformed code (using a different representation) ● Calls between them

Slide 91

Slide 91 text

Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } ● Original code (using the default representation) ● Transformed code (using a different representation) ● Calls between them ???

Slide 92

Slide 92 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 93

Slide 93 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 94

Slide 94 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Easy one. Do nothing

Slide 95

Slide 95 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 96

Slide 96 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 97

Slide 97 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 98

Slide 98 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →

Slide 99

Slide 99 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 100

Slide 100 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 101

Slide 101 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 102

Slide 102 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Do not introduce any conversions. Even across separate compilation

Slide 103

Slide 103 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 104

Slide 104 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →

Slide 105

Slide 105 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 106

Slide 106 text

Composition calling overriding ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 107

Slide 107 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit }

Slide 108

Slide 108 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }

Slide 109

Slide 109 text

trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } Scopes Method print in the class implements method print in the trait

Slide 110

Slide 110 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }

Slide 111

Slide 111 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } }

Slide 112

Slide 112 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait

Slide 113

Slide 113 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait Taken care by the compiler for you!

Slide 114

Slide 114 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 115

Slide 115 text

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 116

Slide 116 text

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster

Slide 117

Slide 117 text

Retrofitting value class status (3,5) 3 5 Header reference

Slide 118

Slide 118 text

Retrofitting value class status Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Slide 119

Slide 119 text

Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Slide 120

Slide 120 text

Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements

Slide 121

Slide 121 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum

Slide 122

Slide 122 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4)

Slide 123

Slide 123 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8)

Slide 124

Slide 124 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Slide 125

Slide 125 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Slide 126

Slide 126 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum }

Slide 127

Slide 127 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function

Slide 128

Slide 128 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function

Slide 129

Slide 129 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18

Slide 130

Slide 130 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster

Slide 131

Slide 131 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 132

Slide 132 text

Research ahead* ! * This may not make it into a product. But you can play with it nevertheless.

Slide 133

Slide 133 text

Spark

Slide 134

Slide 134 text

Spark RDD (Reliable Distributed Dataset)

Slide 135

Slide 135 text

Spark RDD (Reliable Distributed Dataset) Key abstraction in Spark

Slide 136

Slide 136 text

Spark RDD (Reliable Distributed Dataset)

Slide 137

Slide 137 text

Spark RDD (Reliable Distributed Dataset)

Slide 138

Slide 138 text

Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

Slide 139

Slide 139 text

Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file) Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file)

Slide 140

Slide 140 text

Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file) Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file) How does mapping work?

Slide 141

Slide 141 text

Mapping an RDD X Y user function f

Slide 142

Slide 142 text

Mapping an RDD serialized data encoded data X Y user function f decode

Slide 143

Slide 143 text

Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode

Slide 144

Slide 144 text

Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode Allocate object Allocate object

Slide 145

Slide 145 text

Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode Allocate object Allocate object

Slide 146

Slide 146 text

Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode

Slide 147

Slide 147 text

Mapping an RDD serialized data encoded data X Y encoded data user function f decode encode Modified user function (automatically derived by the compiler)

Slide 148

Slide 148 text

Mapping an RDD serialized data encoded data encoded data Modified user function (automatically derived by the compiler)

Slide 149

Slide 149 text

Mapping an RDD serialized data encoded data encoded data Modified user function (automatically derived by the compiler) Nowhere near as simple as it looks

Slide 150

Slide 150 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method

Slide 151

Slide 151 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings

Slide 152

Slide 152 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call

Slide 153

Slide 153 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope

Slide 154

Slide 154 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope ● Reuse the machinery in miniboxing scala-miniboxing.org

Slide 155

Slide 155 text

Challenge: Internal API

Slide 156

Slide 156 text

Challenge: Internal API ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers

Slide 157

Slide 157 text

Challenge: Internal API ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers ● Solution: Extensive refactoring/rewrite

Slide 158

Slide 158 text

Prototype

Slide 159

Slide 159 text

Prototype Hack

Slide 160

Slide 160 text

Prototype Hack ● Modified version of Spark core – RDD data representation is configurable

Slide 161

Slide 161 text

Prototype Hack ● Modified version of Spark core – RDD data representation is configurable ● It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done

Slide 162

Slide 162 text

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect()

Slide 163

Slide 163 text

sc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect() Prototype Hack More details in my talk at Spark Summit EU 2015

Slide 164

Slide 164 text

Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation? Composition

Slide 165

Slide 165 text

Conclusion ● Object-oriented composition → inefficient representation

Slide 166

Slide 166 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming

Slide 167

Slide 167 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data!

Slide 168

Slide 168 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes.

Slide 169

Slide 169 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really.

Slide 170

Slide 170 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!

Slide 171

Slide 171 text

Thank you! Check out scala-ildl.org.

Slide 172

Slide 172 text

Thank you! Check out scala-ildl.org.

Slide 173

Slide 173 text

No content

Slide 174

Slide 174 text

Deforestation and Language Semantics ● Notice that we changed language semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering

Slide 175

Slide 175 text

Deforestation and Language Semantics ● Such transformations are only acceptable with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in

Slide 176

Slide 176 text

Code Generation ● Also known as – Deep Embedding – Multi-Stage Programming ● Awesome speedups, but restricted to small DSLs ● SparkSQL uses code gen to improve performance – By 2-4x over Spark

Slide 177

Slide 177 text

Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles

Slide 178

Slide 178 text

Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles ● Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics

Slide 179

Slide 179 text

Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity

Slide 180

Slide 180 text

Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity ● Can we restrict macros so they're safer? – Data-centric metaprogramming