Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data-centric Metaprogramming @ Spark Summit EU 2015

Data-centric Metaprogramming @ Spark Summit EU 2015

Project website: http://scala-ildl.org

The recording is available at: https://www.youtube.com/watch?v=83VfxITTNPE

Describing data-centric metarprogramming and how it can be used to increase the Dataset implementation.

Vlad Ureche

October 29, 2015
Tweet

More Decks by Vlad Ureche

Other Decks in Programming

Transcript

  1. DATA-CENTRIC
    METAPROGRAMMING
    Vlad Ureche

    View Slide

  2. Vlad Ureche
    PhD in the Scala Team @ EPFL. Soon to graduate ;)

    Working on program transformations focusing on data representation

    Author of miniboxing, which improves generics performance by up to 20x

    Contributed to the Scala compiler and to the scaladoc tool.
    @
    @VladUreche
    @VladUreche
    [email protected]
    scala-miniboxing.org

    View Slide

  3. Research ahead*
    !
    * This may not make it into a product.
    But you can play with it nevertheless.

    View Slide

  4. STOP
    Please ask if things
    are not clear!

    View Slide

  5. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark

    View Slide

  6. Motivation
    Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
    structured-data and used with permission.

    View Slide

  7. Motivation
    Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of-
    structured-data and used with permission.
    Performance gap between
    RDDs and DataFrames

    View Slide

  8. Motivation
    RDD DataFrame

    View Slide

  9. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    View Slide

  10. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster

    View Slide

  11. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster

    View Slide

  12. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster
    ?

    strongly typed

    faster

    View Slide

  13. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster
    Dataset

    strongly typed

    faster

    View Slide

  14. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster
    Dataset

    strongly typed

    faster mid-way

    View Slide

  15. Motivation
    RDD

    strongly typed

    slower
    DataFrame

    dynamically typed

    faster
    Dataset

    strongly typed

    faster mid-way
    Why just mid-way?
    What can we do to speed them up?

    View Slide

  16. Object Composition

    View Slide

  17. Object Composition
    class Vector[T] { … }

    View Slide

  18. Object Composition
    class Vector[T] { … }
    The Vector collection
    in the Scala library

    View Slide

  19. Object Composition
    class Employee(...)
    ID NAME SALARY
    class Vector[T] { … }
    The Vector collection
    in the Scala library

    View Slide

  20. Object Composition
    class Employee(...)
    ID NAME SALARY
    class Vector[T] { … }
    The Vector collection
    in the Scala library
    Corresponds to
    a table row

    View Slide

  21. Object Composition
    class Employee(...)
    ID NAME SALARY
    class Vector[T] { … }

    View Slide

  22. Object Composition
    class Employee(...)
    ID NAME SALARY
    class Vector[T] { … }

    View Slide

  23. Object Composition
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }

    View Slide

  24. Object Composition
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }
    Traversal requires
    dereferencing a pointer
    for each employee.

    View Slide

  25. A Better Representation
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  26. A Better Representation
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  27. A Better Representation

    more efficient heap usage

    faster iteration
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  28. The Problem

    Vector[T] is unaware of Employee

    View Slide

  29. The Problem

    Vector[T] is unaware of Employee
    – Which makes Vector[Employee] suboptimal

    View Slide

  30. The Problem

    Vector[T] is unaware of Employee
    – Which makes Vector[Employee] suboptimal

    Not limited to Vector, other classes also affected

    View Slide

  31. The Problem

    Vector[T] is unaware of Employee
    – Which makes Vector[Employee] suboptimal

    Not limited to Vector, other classes also affected
    – Spark pain point: Functions/closures

    View Slide

  32. The Problem

    Vector[T] is unaware of Employee
    – Which makes Vector[Employee] suboptimal

    Not limited to Vector, other classes also affected
    – Spark pain point: Functions/closures
    – We'd like a "structured" representation throughout

    View Slide

  33. The Problem

    Vector[T] is unaware of Employee
    – Which makes Vector[Employee] suboptimal

    Not limited to Vector, other classes also affected
    – Spark pain point: Functions/closures
    – We'd like a "structured" representation throughout
    Challenge: No means of
    communicating this
    to the compiler

    View Slide

  34. Choice: Safe or Fast

    View Slide

  35. Choice: Safe or Fast
    This is where my
    work comes in...

    View Slide

  36. Data-Centric Metaprogramming

    compiler plug-in that allows

    Tuning data representation

    Website: scala-ildl.org

    View Slide

  37. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark

    View Slide

  38. Transformation
    Definition Application

    View Slide

  39. Transformation
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    View Slide

  40. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    View Slide

  41. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    repetitive and complex

    affects code
    readability

    is verbose

    is error-prone

    View Slide

  42. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    repetitive and complex

    affects code
    readability

    is verbose

    is error-prone
    compiler (automated)

    View Slide

  43. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    repetitive and complex

    affects code
    readability

    is verbose

    is error-prone
    compiler (automated)

    View Slide

  44. Data-Centric Metaprogramming
    object VectorOfEmployeeOpt extends Transformation {
    type Target = Vector[Employee]
    type Result = EmployeeVector
    def toResult(t: Target): Result = ...
    def toTarget(t: Result): Target = ...
    def bypass_length: Int = ...
    def bypass_apply(i: Int): Employee = ...
    def bypass_update(i: Int, v: Employee) = ...
    def bypass_toString: String = ...
    ...
    }

    View Slide

  45. Data-Centric Metaprogramming
    object VectorOfEmployeeOpt extends Transformation {
    type Target = Vector[Employee]
    type Result = EmployeeVector
    def toResult(t: Target): Result = ...
    def toTarget(t: Result): Target = ...
    def bypass_length: Int = ...
    def bypass_apply(i: Int): Employee = ...
    def bypass_update(i: Int, v: Employee) = ...
    def bypass_toString: String = ...
    ...
    }
    What to transform?
    What to transform to?

    View Slide

  46. Data-Centric Metaprogramming
    object VectorOfEmployeeOpt extends Transformation {
    type Target = Vector[Employee]
    type Result = EmployeeVector
    def toResult(t: Target): Result = ...
    def toTarget(t: Result): Target = ...
    def bypass_length: Int = ...
    def bypass_apply(i: Int): Employee = ...
    def bypass_update(i: Int, v: Employee) = ...
    def bypass_toString: String = ...
    ...
    }
    How to
    transform?

    View Slide

  47. Data-Centric Metaprogramming
    object VectorOfEmployeeOpt extends Transformation {
    type Target = Vector[Employee]
    type Result = EmployeeVector
    def toResult(t: Target): Result = ...
    def toTarget(t: Result): Target = ...
    def bypass_length: Int = ...
    def bypass_apply(i: Int): Employee = ...
    def bypass_update(i: Int, v: Employee) = ...
    def bypass_toString: String = ...
    ...
    } How to run methods on the updated representation?

    View Slide

  48. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    repetitive and complex

    affects code
    readability

    is verbose

    is error-prone
    compiler (automated)

    View Slide

  49. Transformation
    programmer
    Definition Application

    can't be automated

    based on experience

    based on speculation

    one-time effort

    repetitive and complex

    affects code
    readability

    is verbose

    is error-prone
    compiler (automated)

    View Slide

  50. http://infoscience.epfl.ch/record/207050?ln=en

    View Slide

  51. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark

    View Slide

  52. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  53. Scenario
    class Employee(...)
    ID NAME SALARY
    class Vector[T] { … }

    View Slide

  54. Scenario
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }

    View Slide

  55. Scenario
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY

    View Slide

  56. Scenario
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    class NewEmployee(...)
    extends Employee(...)
    ID NAME SALARY DEPT

    View Slide

  57. Scenario
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    class NewEmployee(...)
    extends Employee(...)
    ID NAME SALARY DEPT

    View Slide

  58. Scenario
    class Employee(...)
    ID NAME SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    class Vector[T] { … }
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    class NewEmployee(...)
    extends Employee(...)
    ID NAME SALARY DEPT Oooops...

    View Slide

  59. Open World Assumption

    Globally anything can happen

    View Slide

  60. Open World Assumption

    Globally anything can happen

    Locally you have full control:
    – Make class Employee final or
    – Limit the transformation to code that uses Employee

    View Slide

  61. Open World Assumption

    Globally anything can happen

    Locally you have full control:
    – Make class Employee final or
    – Limit the transformation to code that uses Employee
    How?

    View Slide

  62. Open World Assumption

    Globally anything can happen

    Locally you have full control:
    – Make class Employee final or
    – Limit the transformation to code that uses Employee
    How?
    Using
    Scopes!

    View Slide

  63. Scopes
    transform(VectorOfEmployeeOpt) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }

    View Slide

  64. Scopes
    transform(VectorOfEmployeeOpt) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }

    View Slide

  65. Scopes
    transform(VectorOfEmployeeOpt) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }
    Now the method operates
    on the EmployeeVector
    representation.

    View Slide

  66. Scopes

    Can wrap statements, methods, even entire classes
    – Inlined immediately after the parser
    – Definitions are visible outside the "scope"

    View Slide

  67. Scopes

    Can wrap statements, methods, even entire classes
    – Inlined immediately after the parser
    – Definitions are visible outside the "scope"

    Mark locally closed parts of the code
    – Incoming/outgoing values go through conversions
    – You can reject unexpected values

    View Slide

  68. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  69. Best Representation?
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  70. Best Representation?
    It depends.
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  71. Best ...?
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    It depends.
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  72. Best ...?
    Tungsten repr.

    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    It depends.
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  73. Best ...?
    EmployeeJSON
    {
    id: 123,
    name: “John Doe”
    salary: 100
    }
    Tungsten repr.

    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    It depends.
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  74. Scopes allow mixing data representations
    transform(VectorOfEmployeeOpt) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }

    View Slide

  75. Scopes
    transform(VectorOfEmployeeOpt) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }
    Operating on the
    EmployeeVector
    representation.

    View Slide

  76. Scopes
    transform(VectorOfEmployeeCompact) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }
    Operating on the
    compact binary
    representation.

    View Slide

  77. Scopes
    transform(VectorOfEmployeeJSON) {
    def indexSalary(employees: Vector[Employee],
    by: Float): Vector[Employee] =
    for (employee ← employees)
    yield employee.copy(
    salary = (1 + by) * employee.salary
    )
    }
    Operating on the
    JSON-based
    representation.

    View Slide

  78. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  79. Composition

    Code can be
    – Left untransformed (using the original representation)
    – Transformed using different representations

    View Slide

  80. Composition

    Code can be
    – Left untransformed (using the original representation)
    – Transformed using different representations
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  81. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  82. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  83. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation
    Easy one. Do nothing

    View Slide

  84. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  85. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  86. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  87. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation
    Automatically introduce conversions
    between values in the two representations
    e.g. EmployeeVector Vector[Employee] or back

    View Slide

  88. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  89. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  90. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  91. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation
    Hard one. Do not introduce any conversions.
    Even across separate compilation

    View Slide

  92. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  93. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation
    Hard one. Automatically introduce double
    conversions (and warn the programmer)
    e.g. EmployeeVector Vector[Employee] CompactEmpVector
    → →

    View Slide

  94. Composition
    calling

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  95. Composition
    calling
    overriding

    Original code

    Transformed code

    Original code

    Transformed code

    Same transformation

    Different transformation

    View Slide

  96. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }

    View Slide

  97. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }
    Method print in the class
    implements
    method print in the trait

    View Slide

  98. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }

    View Slide

  99. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    transform(VectorOfEmployeeOpt) {
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }
    }

    View Slide

  100. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    transform(VectorOfEmployeeOpt) {
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }
    } The signature of method
    print changes according to
    the transformation it no

    longer implements the trait

    View Slide

  101. Scopes
    trait Printer[T] {
    def print(elements: Vector[T]): Unit
    }
    transform(VectorOfEmployeeOpt) {
    class EmployeePrinter extends Printer[Employee] {
    def print(employee: Vector[Employee]) = ...
    }
    } The signature of method
    print changes according to
    the transformation it no

    longer implements the trait
    Taken care by the
    compiler for you!

    View Slide

  102. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  103. Column-oriented Storage
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY

    View Slide

  104. Column-oriented Storage
    NAME ...
    NAME
    EmployeeVector
    ID ID ...
    ...
    SALARY SALARY
    Vector[Employee]
    ID NAME SALARY
    ID NAME SALARY
    iteration is 5x faster

    View Slide

  105. Retrofitting value class status
    (3,5)
    3 5
    Header
    reference

    View Slide

  106. Retrofitting value class status
    Tuples in Scala are specialized but
    are still objects (not value classes)
    = not as optimized as they could be
    (3,5)
    3 5
    Header
    reference

    View Slide

  107. Retrofitting value class status
    0l + 3 << 32 + 5
    (3,5)
    Tuples in Scala are specialized but
    are still objects (not value classes)
    = not as optimized as they could be
    (3,5)
    3 5
    Header
    reference

    View Slide

  108. Retrofitting value class status
    0l + 3 << 32 + 5
    (3,5)
    Tuples in Scala are specialized but
    are still objects (not value classes)
    = not as optimized as they could be
    (3,5)
    3 5
    Header
    reference
    14x faster, lower
    heap requirements

    View Slide

  109. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum

    View Slide

  110. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4)

    View Slide

  111. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8)

    View Slide

  112. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18

    View Slide

  113. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18

    View Slide

  114. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18
    transform(ListDeforestation) {
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    }

    View Slide

  115. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18
    transform(ListDeforestation) {
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    }
    accumulate
    function

    View Slide

  116. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18
    transform(ListDeforestation) {
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    }
    accumulate
    function
    accumulate
    function

    View Slide

  117. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18
    transform(ListDeforestation) {
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    }
    accumulate
    function
    accumulate
    function
    compute:
    18

    View Slide

  118. Deforestation
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    List(2,3,4) List(4,6,8) 18
    transform(ListDeforestation) {
    List(1,2,3).map(_ + 1).map(_ * 2).sum
    }
    accumulate
    function
    accumulate
    function
    compute:
    18
    6x faster

    View Slide

  119. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  120. Research ahead*
    !
    * This may not make it into a product.
    But you can play with it nevertheless.

    View Slide

  121. Spark

    Optimizations
    – DataFrames do deforestation
    – DataFrames do predicate push-down
    – DataFrames do code generation

    Code is specialized for the data representation

    Functions are specialized for the data representation

    View Slide

  122. Spark

    Optimizations
    – RDDs do deforestation
    – RDDs do predicate push-down
    – RDDs do code generation

    Code is specialized for the data representation

    Functions are specialized for the data representation

    View Slide

  123. Spark

    Optimizations
    – RDDs do deforestation
    – RDDs do predicate push-down
    – RDDs do code generation

    Code is specialized for the data representation

    Functions are specialized for the data representation
    This is what
    makes them slower

    View Slide

  124. Spark

    Optimizations
    – Datasets do deforestation
    – Datasets do predicate push-down
    – Datasets do code generation

    Code is specialized for the data representation

    Functions are specialized for the data representation

    View Slide

  125. User Functions
    X Y
    user
    function
    f

    View Slide

  126. User Functions
    serialized
    data
    encoded
    data
    X Y
    user
    function
    f
    decode

    View Slide

  127. User Functions
    serialized
    data
    encoded
    data
    X Y
    encoded
    data
    user
    function
    f
    decode encode

    View Slide

  128. User Functions
    serialized
    data
    encoded
    data
    X Y
    encoded
    data
    user
    function
    f
    decode encode
    Allocate object Allocate object

    View Slide

  129. User Functions
    serialized
    data
    encoded
    data
    X Y
    encoded
    data
    user
    function
    f
    decode encode
    Allocate object Allocate object

    View Slide

  130. User Functions
    serialized
    data
    encoded
    data
    X Y
    encoded
    data
    user
    function
    f
    decode encode

    View Slide

  131. User Functions
    serialized
    data
    encoded
    data
    X Y
    encoded
    data
    user
    function
    f
    decode encode
    Modified user function
    (automatically derived
    by the compiler)

    View Slide

  132. User Functions
    serialized
    data
    encoded
    data
    encoded
    data
    Modified user function
    (automatically derived
    by the compiler)

    View Slide

  133. User Functions
    serialized
    data
    encoded
    data
    encoded
    data
    Modified user function
    (automatically derived
    by the compiler) Nowhere near as
    simple as it looks

    View Slide

  134. Challenge: Transformation not possible

    Example: Calling outside (untransformed) method

    View Slide

  135. Challenge: Transformation not possible

    Example: Calling outside (untransformed) method

    Solution: Issue compiler warnings

    View Slide

  136. Challenge: Transformation not possible

    Example: Calling outside (untransformed) method

    Solution: Issue compiler warnings
    – Explain why it's not possible: due to the method call

    View Slide

  137. Challenge: Transformation not possible

    Example: Calling outside (untransformed) method

    Solution: Issue compiler warnings
    – Explain why it's not possible: due to the method call
    – Suggest how to fix it: enclose the method in a scope

    View Slide

  138. Challenge: Transformation not possible

    Example: Calling outside (untransformed) method

    Solution: Issue compiler warnings
    – Explain why it's not possible: due to the method call
    – Suggest how to fix it: enclose the method in a scope

    Reuse the machinery in miniboxing
    scala-miniboxing.org

    View Slide

  139. Challenge: Internal API changes

    View Slide

  140. Challenge: Internal API changes

    Spark internals rely on Iterator[T]
    – Requires materializing values
    – Needs to be replaced throughout the code base
    – By rather complex buffers

    View Slide

  141. Challenge: Internal API changes

    Spark internals rely on Iterator[T]
    – Requires materializing values
    – Needs to be replaced throughout the code base
    – By rather complex buffers

    Solution: Extensive refactoring/rewrite

    View Slide

  142. Challenge: Automation

    View Slide

  143. Challenge: Automation

    Existing code should run out of the box

    View Slide

  144. Challenge: Automation

    Existing code should run out of the box

    Solution:
    – Adapt data-centric metaprogramming to Spark
    – Trade generality for simplicity
    – Do the right thing for most of the cases

    View Slide

  145. Challenge: Automation

    Existing code should run out of the box

    Solution:
    – Adapt data-centric metaprogramming to Spark
    – Trade generality for simplicity
    – Do the right thing for most of the cases
    Where are we now?

    View Slide

  146. Prototype

    View Slide

  147. Prototype Hack

    View Slide

  148. Prototype Hack

    Modified version of Spark core
    – RDD data representation is configurable

    View Slide

  149. Prototype Hack

    Modified version of Spark core
    – RDD data representation is configurable

    It's very limited:
    – Custom data repr. only in map, filter and flatMap
    – Otherwise we revert to costly objects
    – Large parts of the automation still need to be done

    View Slide

  150. Prototype Hack
    sc.parallelize(/* 1 million */ records).
    map(x => ...).
    filter(x => ...).
    collect()

    View Slide

  151. Prototype Hack
    sc.parallelize(/* 1 million */ records).
    map(x => ...).
    filter(x => ...).
    collect()

    View Slide

  152. Prototype Hack
    sc.parallelize(/* 1 million */ records).
    map(x => ...).
    filter(x => ...).
    collect() Not yet 2x faster,
    but 1.45x faster

    View Slide

  153. Motivation
    Transformation
    Applications
    Challenges
    Conclusion
    Spark
    Open World
    Best Representation?
    Composition

    View Slide

  154. Conclusion

    Object-oriented composition → inefficient representation

    View Slide

  155. Conclusion

    Object-oriented composition → inefficient representation

    Solution: data-centric metaprogramming

    View Slide

  156. Conclusion

    Object-oriented composition → inefficient representation

    Solution: data-centric metaprogramming
    – Opaque data → Structured data

    View Slide

  157. Conclusion

    Object-oriented composition → inefficient representation

    Solution: data-centric metaprogramming
    – Opaque data → Structured data
    – Is it possible? Yes.

    View Slide

  158. Conclusion

    Object-oriented composition → inefficient representation

    Solution: data-centric metaprogramming
    – Opaque data → Structured data
    – Is it possible? Yes.
    – Is it easy? Not really.

    View Slide

  159. Conclusion

    Object-oriented composition → inefficient representation

    Solution: data-centric metaprogramming
    – Opaque data → Structured data
    – Is it possible? Yes.
    – Is it easy? Not really.
    – Is it worth it? You tell me!

    View Slide

  160. Thank you!
    Check out scala-ildl.org.

    View Slide

  161. Deforestation and Language Semantics

    Notice that we changed language semantics:
    – Before: collections were eager
    – After: collections are lazy
    – This can lead to effects reordering

    View Slide

  162. Deforestation and Language Semantics

    Such transformations are only acceptable with
    programmer consent
    – JIT compilers/staged DSLs can't change semantics
    – metaprogramming (macros) can, but it should be
    documented/opt-in

    View Slide

  163. Code Generation

    Also known as
    – Deep Embedding
    – Multi-Stage Programming

    Awesome speedups, but restricted to small DSLs

    SparkSQL uses code gen to improve performance
    – By 2-4x over Spark

    View Slide

  164. Low-level Optimizers

    Java JIT Compiler
    – Access to the low-level code
    – Can assume a (local) closed world
    – Can speculate based on profiles

    View Slide

  165. Low-level Optimizers

    Java JIT Compiler
    – Access to the low-level code
    – Can assume a (local) closed world
    – Can speculate based on profiles

    Best optimizations break semantics
    – You can't do this in the JIT compiler!
    – Only the programmer can decide to break semantics

    View Slide

  166. Scala Macros

    Many optimizations can be done with macros
    – :) Lots of power
    – :( Lots of responsibility

    Scala compiler invariants

    Object-oriented model

    Modularity

    View Slide

  167. Scala Macros

    Many optimizations can be done with macros
    – :) Lots of power
    – :( Lots of responsibility

    Scala compiler invariants

    Object-oriented model

    Modularity

    Can we restrict macros so they're safer?
    – Data-centric metaprogramming

    View Slide