Slide 1

Slide 1 text

DATA-CENTRIC METAPROGRAMMING Vlad Ureche

Slide 2

Slide 2 text

Vlad Ureche PhD in the Scala Team @ EPFL. Soon to graduate ;) ● Working on program transformations focusing on data representation ● Author of miniboxing, which improves generics performance by up to 20x ● Contributed to the Scala compiler and to the scaladoc tool. @ @VladUreche @VladUreche [email protected] scala-miniboxing.org

Slide 3

Slide 3 text

Research ahead* ! * This may not make it into a product. But you can play with it nevertheless.

Slide 4

Slide 4 text

STOP Please ask if things are not clear!

Slide 5

Slide 5 text

Motivation Transformation Applications Challenges Conclusion Spark

Slide 6

Slide 6 text

Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission.

Slide 7

Slide 7 text

Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission. Performance gap between RDDs and DataFrames

Slide 8

Slide 8 text

Motivation RDD DataFrame

Slide 9

Slide 9 text

Motivation RDD ● strongly typed ● slower DataFrame

Slide 10

Slide 10 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster

Slide 11

Slide 11 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster

Slide 12

Slide 12 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster ? ● strongly typed ● faster

Slide 13

Slide 13 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster Dataset ● strongly typed ● faster

Slide 14

Slide 14 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster Dataset ● strongly typed ● faster mid-way

Slide 15

Slide 15 text

Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster Dataset ● strongly typed ● faster mid-way Why just mid-way? What can we do to speed them up?

Slide 16

Slide 16 text

Object Composition

Slide 17

Slide 17 text

Object Composition class Vector[T] { … }

Slide 18

Slide 18 text

Object Composition class Vector[T] { … } The Vector collection in the Scala library

Slide 19

Slide 19 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … } The Vector collection in the Scala library

Slide 20

Slide 20 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … } The Vector collection in the Scala library Corresponds to a table row

Slide 21

Slide 21 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 22

Slide 22 text

Object Composition class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 23

Slide 23 text

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … }

Slide 24

Slide 24 text

Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.

Slide 25

Slide 25 text

A Better Representation Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 26

Slide 26 text

A Better Representation NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 27

Slide 27 text

A Better Representation ● more efficient heap usage ● faster iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 28

Slide 28 text

The Problem ● Vector[T] is unaware of Employee

Slide 29

Slide 29 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal

Slide 30

Slide 30 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected

Slide 31

Slide 31 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures

Slide 32

Slide 32 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout

Slide 33

Slide 33 text

The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout Challenge: No means of communicating this to the compiler

Slide 34

Slide 34 text

Choice: Safe or Fast

Slide 35

Slide 35 text

Choice: Safe or Fast This is where my work comes in...

Slide 36

Slide 36 text

Data-Centric Metaprogramming ● compiler plug-in that allows ● Tuning data representation ● Website: scala-ildl.org

Slide 37

Slide 37 text

Motivation Transformation Applications Challenges Conclusion Spark

Slide 38

Slide 38 text

Transformation Definition Application

Slide 39

Slide 39 text

Transformation Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort

Slide 40

Slide 40 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort

Slide 41

Slide 41 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone

Slide 42

Slide 42 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 43

Slide 43 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 44

Slide 44 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }

Slide 45

Slide 45 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } What to transform? What to transform to?

Slide 46

Slide 46 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to transform?

Slide 47

Slide 47 text

Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?

Slide 48

Slide 48 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 49

Slide 49 text

Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)

Slide 50

Slide 50 text

http://infoscience.epfl.ch/record/207050?ln=en

Slide 51

Slide 51 text

Motivation Transformation Applications Challenges Conclusion Spark

Slide 52

Slide 52 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 53

Slide 53 text

Scenario class Employee(...) ID NAME SALARY class Vector[T] { … }

Slide 54

Slide 54 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … }

Slide 55

Slide 55 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY

Slide 56

Slide 56 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

Slide 57

Slide 57 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT

Slide 58

Slide 58 text

Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...

Slide 59

Slide 59 text

Open World Assumption ● Globally anything can happen

Slide 60

Slide 60 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee

Slide 61

Slide 61 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?

Slide 62

Slide 62 text

Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!

Slide 63

Slide 63 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

Slide 64

Slide 64 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

Slide 65

Slide 65 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.

Slide 66

Slide 66 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope"

Slide 67

Slide 67 text

Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" ● Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values

Slide 68

Slide 68 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 69

Slide 69 text

Best Representation? Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 70

Slide 70 text

Best Representation? It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 71

Slide 71 text

Best ...? NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 72

Slide 72 text

Best ...? Tungsten repr. NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 73

Slide 73 text

Best ...? EmployeeJSON { id: 123, name: “John Doe” salary: 100 } Tungsten repr. NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 74

Slide 74 text

Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }

Slide 75

Slide 75 text

Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.

Slide 76

Slide 76 text

Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.

Slide 77

Slide 77 text

Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.

Slide 78

Slide 78 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 79

Slide 79 text

Composition ● Code can be – Left untransformed (using the original representation) – Transformed using different representations

Slide 80

Slide 80 text

Composition ● Code can be – Left untransformed (using the original representation) – Transformed using different representations calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 81

Slide 81 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 82

Slide 82 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 83

Slide 83 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Easy one. Do nothing

Slide 84

Slide 84 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 85

Slide 85 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 86

Slide 86 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 87

Slide 87 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →

Slide 88

Slide 88 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 89

Slide 89 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 90

Slide 90 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 91

Slide 91 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Do not introduce any conversions. Even across separate compilation

Slide 92

Slide 92 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 93

Slide 93 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →

Slide 94

Slide 94 text

Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 95

Slide 95 text

Composition calling overriding ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation

Slide 96

Slide 96 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }

Slide 97

Slide 97 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } Method print in the class implements method print in the trait

Slide 98

Slide 98 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }

Slide 99

Slide 99 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } }

Slide 100

Slide 100 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait

Slide 101

Slide 101 text

Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait Taken care by the compiler for you!

Slide 102

Slide 102 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 103

Slide 103 text

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY

Slide 104

Slide 104 text

Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster

Slide 105

Slide 105 text

Retrofitting value class status (3,5) 3 5 Header reference

Slide 106

Slide 106 text

Retrofitting value class status Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Slide 107

Slide 107 text

Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference

Slide 108

Slide 108 text

Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements

Slide 109

Slide 109 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum

Slide 110

Slide 110 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4)

Slide 111

Slide 111 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8)

Slide 112

Slide 112 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Slide 113

Slide 113 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

Slide 114

Slide 114 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum }

Slide 115

Slide 115 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function

Slide 116

Slide 116 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function

Slide 117

Slide 117 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18

Slide 118

Slide 118 text

Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation) { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster

Slide 119

Slide 119 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 120

Slide 120 text

Research ahead* ! * This may not make it into a product. But you can play with it nevertheless.

Slide 121

Slide 121 text

Spark ● Optimizations – DataFrames do deforestation – DataFrames do predicate push-down – DataFrames do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation

Slide 122

Slide 122 text

Spark ● Optimizations – RDDs do deforestation – RDDs do predicate push-down – RDDs do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation

Slide 123

Slide 123 text

Spark ● Optimizations – RDDs do deforestation – RDDs do predicate push-down – RDDs do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation This is what makes them slower

Slide 124

Slide 124 text

Spark ● Optimizations – Datasets do deforestation – Datasets do predicate push-down – Datasets do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation

Slide 125

Slide 125 text

User Functions X Y user function f

Slide 126

Slide 126 text

User Functions serialized data encoded data X Y user function f decode

Slide 127

Slide 127 text

User Functions serialized data encoded data X Y encoded data user function f decode encode

Slide 128

Slide 128 text

User Functions serialized data encoded data X Y encoded data user function f decode encode Allocate object Allocate object

Slide 129

Slide 129 text

User Functions serialized data encoded data X Y encoded data user function f decode encode Allocate object Allocate object

Slide 130

Slide 130 text

User Functions serialized data encoded data X Y encoded data user function f decode encode

Slide 131

Slide 131 text

User Functions serialized data encoded data X Y encoded data user function f decode encode Modified user function (automatically derived by the compiler)

Slide 132

Slide 132 text

User Functions serialized data encoded data encoded data Modified user function (automatically derived by the compiler)

Slide 133

Slide 133 text

User Functions serialized data encoded data encoded data Modified user function (automatically derived by the compiler) Nowhere near as simple as it looks

Slide 134

Slide 134 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method

Slide 135

Slide 135 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings

Slide 136

Slide 136 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call

Slide 137

Slide 137 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope

Slide 138

Slide 138 text

Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope ● Reuse the machinery in miniboxing scala-miniboxing.org

Slide 139

Slide 139 text

Challenge: Internal API changes

Slide 140

Slide 140 text

Challenge: Internal API changes ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers

Slide 141

Slide 141 text

Challenge: Internal API changes ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers ● Solution: Extensive refactoring/rewrite

Slide 142

Slide 142 text

Challenge: Automation

Slide 143

Slide 143 text

Challenge: Automation ● Existing code should run out of the box

Slide 144

Slide 144 text

Challenge: Automation ● Existing code should run out of the box ● Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases

Slide 145

Slide 145 text

Challenge: Automation ● Existing code should run out of the box ● Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases Where are we now?

Slide 146

Slide 146 text

Prototype

Slide 147

Slide 147 text

Prototype Hack

Slide 148

Slide 148 text

Prototype Hack ● Modified version of Spark core – RDD data representation is configurable

Slide 149

Slide 149 text

Prototype Hack ● Modified version of Spark core – RDD data representation is configurable ● It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done

Slide 150

Slide 150 text

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect()

Slide 151

Slide 151 text

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect()

Slide 152

Slide 152 text

Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...). filter(x => ...). collect() Not yet 2x faster, but 1.45x faster

Slide 153

Slide 153 text

Motivation Transformation Applications Challenges Conclusion Spark Open World Best Representation? Composition

Slide 154

Slide 154 text

Conclusion ● Object-oriented composition → inefficient representation

Slide 155

Slide 155 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming

Slide 156

Slide 156 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data

Slide 157

Slide 157 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data – Is it possible? Yes.

Slide 158

Slide 158 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really.

Slide 159

Slide 159 text

Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!

Slide 160

Slide 160 text

Thank you! Check out scala-ildl.org.

Slide 161

Slide 161 text

Deforestation and Language Semantics ● Notice that we changed language semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering

Slide 162

Slide 162 text

Deforestation and Language Semantics ● Such transformations are only acceptable with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in

Slide 163

Slide 163 text

Code Generation ● Also known as – Deep Embedding – Multi-Stage Programming ● Awesome speedups, but restricted to small DSLs ● SparkSQL uses code gen to improve performance – By 2-4x over Spark

Slide 164

Slide 164 text

Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles

Slide 165

Slide 165 text

Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles ● Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics

Slide 166

Slide 166 text

Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity

Slide 167

Slide 167 text

Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity ● Can we restrict macros so they're safer? – Data-centric metaprogramming