Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-centric Metaprogramming @ Spark Summit EU 2015

Data-centric Metaprogramming @ Spark Summit EU 2015

Project website: http://scala-ildl.org

The recording is available at: https://www.youtube.com/watch?v=83VfxITTNPE

Describing data-centric metarprogramming and how it can be used to increase the Dataset implementation.

Vlad Ureche

October 29, 2015
Tweet

More Decks by Vlad Ureche

Other Decks in Programming

Transcript

  1. Vlad Ureche PhD in the Scala Team @ EPFL. Soon

    to graduate ;) • Working on program transformations focusing on data representation • Author of miniboxing, which improves generics performance by up to 20x • Contributed to the Scala compiler and to the scaladoc tool. @ @VladUreche @VladUreche [email protected] scala-miniboxing.org
  2. Research ahead* ! * This may not make it into

    a product. But you can play with it nevertheless.
  3. Motivation RDD • strongly typed • slower DataFrame • dynamically

    typed • faster ? • strongly typed • faster
  4. Motivation RDD • strongly typed • slower DataFrame • dynamically

    typed • faster Dataset • strongly typed • faster
  5. Motivation RDD • strongly typed • slower DataFrame • dynamically

    typed • faster Dataset • strongly typed • faster mid-way
  6. Motivation RDD • strongly typed • slower DataFrame • dynamically

    typed • faster Dataset • strongly typed • faster mid-way Why just mid-way? What can we do to speed them up?
  7. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library
  8. Object Composition class Employee(...) ID NAME SALARY class Vector[T] {

    … } The Vector collection in the Scala library Corresponds to a table row
  9. Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME

    SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.
  10. A Better Representation NAME ... NAME EmployeeVector ID ID ...

    ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  11. A Better Representation • more efficient heap usage • faster

    iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  12. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected
  13. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures
  14. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout
  15. The Problem • Vector[T] is unaware of Employee – Which

    makes Vector[Employee] suboptimal • Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout Challenge: No means of communicating this to the compiler
  16. Transformation Definition Application • can't be automated • based on

    experience • based on speculation • one-time effort
  17. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort
  18. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone
  19. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  20. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  21. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... }
  22. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } What to transform? What to transform to?
  23. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to transform?
  24. Data-Centric Metaprogramming object VectorOfEmployeeOpt extends Transformation { type Target =

    Vector[Employee] type Result = EmployeeVector def toResult(t: Target): Result = ... def toTarget(t: Result): Target = ... def bypass_length: Int = ... def bypass_apply(i: Int): Employee = ... def bypass_update(i: Int, v: Employee) = ... def bypass_toString: String = ... ... } How to run methods on the updated representation?
  25. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  26. Transformation programmer Definition Application • can't be automated • based

    on experience • based on speculation • one-time effort • repetitive and complex • affects code readability • is verbose • is error-prone compiler (automated)
  27. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY
  28. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  29. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
  30. Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY

    ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
  31. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee
  32. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?
  33. Open World Assumption • Globally anything can happen • Locally

    you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!
  34. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  35. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  36. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Now the method operates on the EmployeeVector representation.
  37. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope"
  38. Scopes • Can wrap statements, methods, even entire classes –

    Inlined immediately after the parser – Definitions are visible outside the "scope" • Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
  39. Best ...? NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  40. Best ...? Tungsten repr. <compressed binary blob> NAME ... NAME

    EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  41. Best ...? EmployeeJSON { id: 123, name: “John Doe” salary:

    100 } Tungsten repr. <compressed binary blob> NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY It depends. Vector[Employee] ID NAME SALARY ID NAME SALARY
  42. Scopes allow mixing data representations transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee],

    by: Float): Vector[Employee] = for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) }
  43. Scopes transform(VectorOfEmployeeOpt) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the EmployeeVector representation.
  44. Scopes transform(VectorOfEmployeeCompact) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the compact binary representation.
  45. Scopes transform(VectorOfEmployeeJSON) { def indexSalary(employees: Vector[Employee], by: Float): Vector[Employee] =

    for (employee ← employees) yield employee.copy( salary = (1 + by) * employee.salary ) } Operating on the JSON-based representation.
  46. Composition • Code can be – Left untransformed (using the

    original representation) – Transformed using different representations
  47. Composition • Code can be – Left untransformed (using the

    original representation) – Transformed using different representations calling • Original code • Transformed code • Original code • Transformed code • Same transformation • Different transformation
  48. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  49. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  50. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Easy one. Do nothing
  51. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  52. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  53. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  54. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
  55. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  56. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  57. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  58. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Do not introduce any conversions. Even across separate compilation
  59. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  60. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation Hard one. Automatically introduce double conversions (and warn the programmer) e.g. EmployeeVector Vector[Employee] CompactEmpVector → →
  61. Composition calling • Original code • Transformed code • Original

    code • Transformed code • Same transformation • Different transformation
  62. Composition calling overriding • Original code • Transformed code •

    Original code • Transformed code • Same transformation • Different transformation
  63. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }
  64. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } Method print in the class implements method print in the trait
  65. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class

    EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... }
  66. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } }
  67. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait
  68. Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt)

    { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait Taken care by the compiler for you!
  69. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
  70. Column-oriented Storage NAME ... NAME EmployeeVector ID ID ... ...

    SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY iteration is 5x faster
  71. Retrofitting value class status Tuples in Scala are specialized but

    are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  72. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
  73. Retrofitting value class status 0l + 3 << 32 +

    5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
  74. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function
  75. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18
  76. Deforestation List(1,2,3).map(_ + 1).map(_ * 2).sum List(2,3,4) List(4,6,8) 18 transform(ListDeforestation)

    { List(1,2,3).map(_ + 1).map(_ * 2).sum } accumulate function accumulate function compute: 18 6x faster
  77. Research ahead* ! * This may not make it into

    a product. But you can play with it nevertheless.
  78. Spark • Optimizations – DataFrames do deforestation – DataFrames do

    predicate push-down – DataFrames do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
  79. Spark • Optimizations – RDDs do deforestation – RDDs do

    predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
  80. Spark • Optimizations – RDDs do deforestation – RDDs do

    predicate push-down – RDDs do code generation • Code is specialized for the data representation • Functions are specialized for the data representation This is what makes them slower
  81. Spark • Optimizations – Datasets do deforestation – Datasets do

    predicate push-down – Datasets do code generation • Code is specialized for the data representation • Functions are specialized for the data representation
  82. User Functions serialized data encoded data X Y encoded data

    user function f decode encode Allocate object Allocate object
  83. User Functions serialized data encoded data X Y encoded data

    user function f decode encode Allocate object Allocate object
  84. User Functions serialized data encoded data X Y encoded data

    user function f decode encode Modified user function (automatically derived by the compiler)
  85. User Functions serialized data encoded data encoded data Modified user

    function (automatically derived by the compiler)
  86. User Functions serialized data encoded data encoded data Modified user

    function (automatically derived by the compiler) Nowhere near as simple as it looks
  87. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call
  88. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope
  89. Challenge: Transformation not possible • Example: Calling outside (untransformed) method

    • Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope • Reuse the machinery in miniboxing scala-miniboxing.org
  90. Challenge: Internal API changes • Spark internals rely on Iterator[T]

    – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers
  91. Challenge: Internal API changes • Spark internals rely on Iterator[T]

    – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers • Solution: Extensive refactoring/rewrite
  92. Challenge: Automation • Existing code should run out of the

    box • Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases
  93. Challenge: Automation • Existing code should run out of the

    box • Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases Where are we now?
  94. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable
  95. Prototype Hack • Modified version of Spark core – RDD

    data representation is configurable • It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
  96. Prototype Hack sc.parallelize(/* 1 million */ records). map(x => ...).

    filter(x => ...). collect() Not yet 2x faster, but 1.45x faster
  97. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Opaque data → Structured data – Is it possible? Yes.
  98. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really.
  99. Conclusion • Object-oriented composition → inefficient representation • Solution: data-centric

    metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!
  100. Deforestation and Language Semantics • Notice that we changed language

    semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering
  101. Deforestation and Language Semantics • Such transformations are only acceptable

    with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in
  102. Code Generation • Also known as – Deep Embedding –

    Multi-Stage Programming • Awesome speedups, but restricted to small DSLs • SparkSQL uses code gen to improve performance – By 2-4x over Spark
  103. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles
  104. Low-level Optimizers • Java JIT Compiler – Access to the

    low-level code – Can assume a (local) closed world – Can speculate based on profiles • Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
  105. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity
  106. Scala Macros • Many optimizations can be done with macros

    – :) Lots of power – :( Lots of responsibility • Scala compiler invariants • Object-oriented model • Modularity • Can we restrict macros so they're safer? – Data-centric metaprogramming