Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-centric Metaprogramming - Scala Days 2016

Data-centric Metaprogramming - Scala Days 2016

Presentation at Scala Days 2016

Vlad Ureche

June 16, 2016
Tweet

More Decks by Vlad Ureche

Other Decks in Programming

Transcript

  1. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library
  2. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library Corresponds to a table row
  3. Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME

    SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.
  4. A Better Representation NAME ... NAME EmployeeVector ID ID ...

    ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  5. A Better Representation • more efficient heap usage • faster

    iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  6. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected
  7. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections)
  8. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions
  9. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations
  10. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale
  11. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other constructs also affected – Generics (including all collections) and Functions • We know better representations – Manual changes don't scale – The compiler should do that
  12. Current Optimizers • They do a great job What about

    the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  13. Current Optimizers • They do a great job – But

    have to respect semantics What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  14. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  15. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  16. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  17. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  18. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native?
  19. Current Optimizers • They do a great job – But

    have to respect semantics – Support every corner case – Have to be conservative :( • Programmers have control – What/When/How is accessed – Can break semantics (speculate) What about the Scala.js optimizer? What about the Dotty Linker? Scala Native? Challenge: No means of telling the compiler what/when to speculate
  20. Transformation Definition Application • can't be automated • based on

    experience • based on speculation • one-time effort
  21. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort
  22. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone
  23. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  24. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  25. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }
  26. object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type

    Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming What to transform? What to transform to?
  27. object VectorOfEmployeeOpt extends Transformation { type Target = Vector[Employee] type

    Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } Data-Centric Metaprogramming How to transform?
  28. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?
  29. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  30. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  31. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY
  32. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  33. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  34. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
  35. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee
  36. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?
  37. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!
  38. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary )
  39. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  40. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.
  41. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope"
  42. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope" No, it's not a macro. It's a marker for the compiler plugin. (You can't do this with macros)
  43. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope"
  44. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
  45. Best ...? NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  46. Best ...? Compact binary repr. <compact binary blob> NAME ...

    NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  47. Best ...? EmployeeJSON { id: 123, name: “John Doe” salary:

    100 } Compact binary repr. <compact binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  48. Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee],

    by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  49. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.
  50. Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.
  51. Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.
  52. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation)
  53. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation)
  54. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them
  55. Composition def index1Percent(employees: Vector[Employee]) = indexSalary(employees, 0.01) transform(VectorOfEmployeeJSON) { def

    indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] = ... } • Original code (using the default representation) • Transformed code (using a different representation) • Calls between them ???
  56. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  57. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  58. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Easy one. Do nothing
  59. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  60. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  61. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  62. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
  63. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  64. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  65. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  66. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Do not introduce any conversions. Even across separate compilation
  67. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  68. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →
  69. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  70. Composition calling overriding • Original code • Transformed code •

    Original code • Transformed code • Same transformation • Different transformation
  71. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }
  72. trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter

    extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } Scopes Method print in the class implements method print in the trait
  73. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... }
  74. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } }
  75. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait
  76. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(elements: Vector[Employee]) = ... } } The signature of method print changes according to the transformation → it no longer implements the trait Taken care by the compiler for you!
  77. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  78. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster
  79. Retrofitting value class status Tuples in Scala are specialized but

    are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  80. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  81. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
  82. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function
  83. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18
  84. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster
  85. Research ahead* ! * This may not make it into

    a product. But you can play with it nevertheless.
  86. Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

    Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file)
  87. Spark RDD (Reliable Distributed Dataset) Primary Data (e.g. CSV file)

    Primary Data (e.g. CSV file) Derived Data (e.g. primary.map(f)) Primary Data (e.g. CSV file) How does mapping work?
  88. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode
  89. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Allocate object Allocate object
  90. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Allocate object Allocate object
  91. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode
  92. Mapping an RDD serialized data encoded data X Y encoded

    data user function f decode encode Modified user function (automatically derived by the compiler)
  93. Mapping an RDD serialized data encoded data encoded data Modified

    user function (automatically derived by the compiler)
  94. Mapping an RDD serialized data encoded data encoded data Modified

    user function (automatically derived by the compiler) Nowhere near as simple as it looks
  95. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call
  96. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope
  97. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org
  98. Challenge: Internal API • Spark internals rely on Iterator[T] –

    Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers
  99. Challenge: Internal API • Spark internals rely on Iterator[T] –

    Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers • Solution: Extensive refactoring/rewrite
  100. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable
  101. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
  102. sc.parallelize(/* 1 million */ records). map(x => ...). filter(x =>

    ...). collect() Prototype Hack More details in my talk at Spark Summit EU 2015
  103. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes.
  104. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really.
  105. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Use the best representation for your data! – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!
  106. Deforestation and Language Semantics • Notice that we changed language

    semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering
  107. Deforestation and Language Semantics • Such transformations are only acceptable

    with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in
  108. Code Generation • Also known as – Deep Embedding –

    Multi-Stage Programming • Awesome speedups, but restricted to small DSLs • SparkSQL uses code gen to improve performance – By 2-4x over Spark
  109. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles
  110. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
  111. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity
  112. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming