Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-centric Metaprogramming - Scala Days 2016

Data-centric Metaprogramming - Scala Days 2016

Presentation at Scala Days 2016

456d1d6154efe50e950b65f966f63a50?s=128

Vlad Ureche

June 16, 2016
Tweet

Transcript

  1. Data-centric Metaprogramming Vlad Ureche

  2. Vlad Ureche @ @VladUreche @VladUreche vlad.ureche@gmail.com

  3. Vlad Ureche Software Engineer at Cyberhaven.io scala-miniboxing.org Ex-Scala Team at

    EPFL
  4. STOP Please ask if things are not clear!

  5. Motivation Transformation Applications Challenges Conclusion Functions

  6. Object Composition

  7. Object Composition class Vector[T] { … }

  8. Object Composition class Vector[T] { … } The Vector collection

    in the Scala library
  9. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library
  10. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library Corresponds to a table row
  11. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … }
  12. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … }
  13. Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME

    SALARY ID NAME SALARY class Vector[T] { … }
  14. Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME

    SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.
  15. A Better Representation Vector[Employee] ID NAME SALARY ID NAME SALARY

  16. A Better Representation NAME ... NAME EmployeeVector ID ID ...

    ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  17. A Better Representation • more efficient heap usage • faster

    iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  18. The Problem • Vector[T] is unaware of Employee

  19. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal
  20. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected
  21. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections)
  22. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions
  23. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations
  24. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale
  25. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale – The compiler should do that
  26. Current Optimizers

  27. Current Optimizers What about the Scala.js optimizer?

  28. Current Optimizers What about the Scala.js optimizer? What about the

    Dotty Linker?
  29. Current Optimizers What about the Scala.js optimizer? What about the

    Dotty Linker? Scala Native?
  30. Current Optimizers • They do a great job What about

    the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  31. Current Optimizers • They do a great job – But

    have to respect semantics What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  32. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  33. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  34. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  35. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  36. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  37. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native? Challenge: No means of telling the compiler what/when to speculate
  38. Choice: Safe or Fast

  39. Choice: Safe or Fast This is where my work comes

    in...
  40. Data-Centric Metaprogramming • compiler plug-in that allows • Tuning data

    representation • Website: scala-ildl.org
  41. Motivation Transformation Applications Challenges Conclusion Functions

  42. Transformation Definition Application

  43. Transformation Definition Application • can't be automated • based on

    experience • based on speculation • one-time effort
  44. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort
  45. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone
  46. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  47. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  48. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }
  49. object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type

    Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming What to transform? What to transform to?
  50. object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type

    Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming How to transform?
  51. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?
  52. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  53. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  54. http://infoscience.epfl.ch/record/207050?ln=en

  55. Motivation Transformation Applications Challenges Conclusion Functions

  56. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  57. Scenario class Employee(...) ID NAME SALARY class Vector[T] { …

    }
  58. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … }
  59. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY
  60. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  61. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  62. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
  63. Open World Assumption • Globally anything can happen

  64. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee
  65. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?
  66. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!
  67. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
  68. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  69. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.
  70. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope"
  71. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope" No, it's not a macro. It's a marker for the compiler plugin. (You can't do this with macros)
  72. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope"
  73. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
  74. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  75. Best Representation? Vector[Employee] ID NAME SALARY ID NAME SALARY

  76. Best Representation? It depends. Vector[Employee] ID NAME SALARY ID NAME

    SALARY
  77. Best ...? NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  78. Best ...? Compact binary repr. <compact binary blob> NAME ...

    NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  79. Best ...? EmployeeJSON { id: 123, name: “John Doe” salary:

    100 } Compact binary repr. <compact binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  80. Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee],

    by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  81. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.
  82. Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.
  83. Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.
  84. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  85. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01)

  86. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... }
  87. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... }
  88. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation)
  89. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation)
  90. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them
  91. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them ???
  92. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  93. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  94. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Easy one. Do nothing
  95. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  96. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  97. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  98. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
  99. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  100. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  101. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  102. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Do not introduce any conversions. Even across separate compilation
  103. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  104. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →
  105. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  106. Composition calling overriding • Original code • Transformed code •

    Original code • Transformed code • Same transformation • Different transformation
  107. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit }

  108. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }
  109. trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter

    extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } Scopes Method print in the class implements method print in the trait
  110. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }
  111. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } }
  112. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait
  113. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait Taken care by the compiler for you!
  114. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  115. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  116. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster
  117. Retrofitting value class status (3,5) 3 5 Header reference

  118. Retrofitting value class status Tuples in Scala are specialized but

    are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  119. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  120. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
  121. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum

  122. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4)

  123. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8)

  124. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

  125. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18

  126. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum }
  127. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function
  128. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function
  129. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18
  130. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster
  131. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  132. Research ahead* ! * This may not make it into

    a product. But you can play with it nevertheless.
  133. Spark

  134. Spark RDD (Reliable Distributed Dataset)

  135. Spark RDD (Reliable Distributed Dataset) Key abstraction in Spark

  136. Spark RDD (Reliable Distributed Dataset)

  137. Spark RDD (Reliable Distributed Dataset)

  138. Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

  139. Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

    Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file)
  140. Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

    Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file) How does mapping work?
  141. Mapping an RDD X Y user function f

  142. Mapping an RDD serialized data encoded data X Y user

    function f decode
  143. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode
  144. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Allocate object Allocate object
  145. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Allocate object Allocate object
  146. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode
  147. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Modified user function (automatically derived by the compiler)
  148. Mapping an RDD serialized data encoded data encoded data Modified

    user function (automatically derived by the compiler)
  149. Mapping an RDD serialized data encoded data encoded data Modified

    user function (automatically derived by the compiler) Nowhere near as simple as it looks
  150. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

  151. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings
  152. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call
  153. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope
  154. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org
  155. Challenge: Internal API

  156. Challenge: Internal API • Spark internals rely on Iterator[T] –

    Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers
  157. Challenge: Internal API • Spark internals rely on Iterator[T] –

    Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers • Solution: Extensive refactoring/rewrite
  158. Prototype

  159. Prototype Hack

  160. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable
  161. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
  162. Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...).

    filter(x => ...). collect()
  163. sc.parallelize(/* 1 million */ records). map(x => ...). filter(x =>

    ...). collect() Prototype Hack More details in my talk at Spark Summit EU 2015
  164. Motivation Transformation Applications Challenges Conclusion Functions Open World Best Representation?

    Composition
  165. Conclusion • Object-oriented composition → inefficient representation

  166. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming
  167. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data!
  168. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes.
  169. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really.
  170. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!
  171. Thank you! Check out scala-ildl.org.

  172. Thank you! Check out scala-ildl.org.

  173. None
  174. Deforestation and Language Semantics • Notice that we changed language

    semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering
  175. Deforestation and Language Semantics • Such transformations are only acceptable

    with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in
  176. Code Generation • Also known as – Deep Embedding –

    Multi-Stage Programming • Awesome speedups, but restricted to small DSLs • SparkSQL uses code gen to improve performance – By 2-4x over Spark
  177. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles
  178. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
  179. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity
  180. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming