Vlad Ureche PhD in the Scala Team @ EPFL. Soon to graduate ;) ● Working on program transformations focusing on data representation ● Author of miniboxing, which improves generics performance by up to 20x ● Contributed to the Scala compiler and to the scaladoc tool. @ @VladUreche @VladUreche [email protected] scala-miniboxing.org
Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission.
Motivation Comparison graph from http://fr.slideshare.net/databricks/spark-summit-eu-2015-spark-dataframes-simple-and-fast-analysis-of- structured-data and used with permission. Performance gap between RDDs and DataFrames
Motivation RDD ● strongly typed ● slower DataFrame ● dynamically typed ● faster Dataset ● strongly typed ● faster mid-way Why just mid-way? What can we do to speed them up?
Object Composition class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } Traversal requires dereferencing a pointer for each employee.
A Better Representation ● more efficient heap usage ● faster iteration NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout
The Problem ● Vector[T] is unaware of Employee – Which makes Vector[Employee] suboptimal ● Not limited to Vector, other classes also affected – Spark pain point: Functions/closures – We'd like a "structured" representation throughout Challenge: No means of communicating this to the compiler
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Transformation programmer Definition Application ● can't be automated ● based on experience ● based on speculation ● one-time effort ● repetitive and complex ● affects code readability ● is verbose ● is error-prone compiler (automated)
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT
Scenario class Employee(...) ID NAME SALARY Vector[Employee] ID NAME SALARY ID NAME SALARY class Vector[T] { … } NAME ... NAME EmployeeVector ID ID ... ... SALARY SALARY class NewEmployee(...) extends Employee(...) ID NAME SALARY DEPT Oooops...
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How?
Open World Assumption ● Globally anything can happen ● Locally you have full control: – Make class Employee final or – Limit the transformation to code that uses Employee How? Using Scopes!
Scopes ● Can wrap statements, methods, even entire classes – Inlined immediately after the parser – Definitions are visible outside the "scope" ● Mark locally closed parts of the code – Incoming/outgoing values go through conversions – You can reject unexpected values
Composition ● Code can be – Left untransformed (using the original representation) – Transformed using different representations calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Easy one. Do nothing
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Automatically introduce conversions between values in the two representations e.g. EmployeeVector Vector[Employee] or back →
Composition calling ● Original code ● Transformed code ● Original code ● Transformed code ● Same transformation ● Different transformation Hard one. Do not introduce any conversions. Even across separate compilation
Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } Method print in the class implements method print in the trait
Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait
Scopes trait Printer[T] { def print(elements: Vector[T]): Unit } transform(VectorOfEmployeeOpt) { class EmployeePrinter extends Printer[Employee] { def print(employee: Vector[Employee]) = ... } } The signature of method print changes according to the transformation it no → longer implements the trait Taken care by the compiler for you!
Retrofitting value class status Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference
Retrofitting value class status 0l + 3 << 32 + 5 (3,5) Tuples in Scala are specialized but are still objects (not value classes) = not as optimized as they could be (3,5) 3 5 Header reference 14x faster, lower heap requirements
Spark ● Optimizations – DataFrames do deforestation – DataFrames do predicate push-down – DataFrames do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation
Spark ● Optimizations – RDDs do deforestation – RDDs do predicate push-down – RDDs do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation
Spark ● Optimizations – RDDs do deforestation – RDDs do predicate push-down – RDDs do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation This is what makes them slower
Spark ● Optimizations – Datasets do deforestation – Datasets do predicate push-down – Datasets do code generation ● Code is specialized for the data representation ● Functions are specialized for the data representation
User Functions serialized data encoded data X Y encoded data user function f decode encode Modified user function (automatically derived by the compiler)
User Functions serialized data encoded data encoded data Modified user function (automatically derived by the compiler) Nowhere near as simple as it looks
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope
Challenge: Transformation not possible ● Example: Calling outside (untransformed) method ● Solution: Issue compiler warnings – Explain why it's not possible: due to the method call – Suggest how to fix it: enclose the method in a scope ● Reuse the machinery in miniboxing scala-miniboxing.org
Challenge: Internal API changes ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers
Challenge: Internal API changes ● Spark internals rely on Iterator[T] – Requires materializing values – Needs to be replaced throughout the code base – By rather complex buffers ● Solution: Extensive refactoring/rewrite
Challenge: Automation ● Existing code should run out of the box ● Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases
Challenge: Automation ● Existing code should run out of the box ● Solution: – Adapt data-centric metaprogramming to Spark – Trade generality for simplicity – Do the right thing for most of the cases Where are we now?
Prototype Hack ● Modified version of Spark core – RDD data representation is configurable ● It's very limited: – Custom data repr. only in map, filter and flatMap – Otherwise we revert to costly objects – Large parts of the automation still need to be done
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really.
Conclusion ● Object-oriented composition → inefficient representation ● Solution: data-centric metaprogramming – Opaque data → Structured data – Is it possible? Yes. – Is it easy? Not really. – Is it worth it? You tell me!
Deforestation and Language Semantics ● Notice that we changed language semantics: – Before: collections were eager – After: collections are lazy – This can lead to effects reordering
Deforestation and Language Semantics ● Such transformations are only acceptable with programmer consent – JIT compilers/staged DSLs can't change semantics – metaprogramming (macros) can, but it should be documented/opt-in
Code Generation ● Also known as – Deep Embedding – Multi-Stage Programming ● Awesome speedups, but restricted to small DSLs ● SparkSQL uses code gen to improve performance – By 2-4x over Spark
Low-level Optimizers ● Java JIT Compiler – Access to the low-level code – Can assume a (local) closed world – Can speculate based on profiles ● Best optimizations break semantics – You can't do this in the JIT compiler! – Only the programmer can decide to break semantics
Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity
Scala Macros ● Many optimizations can be done with macros – :) Lots of power – :( Lots of responsibility ● Scala compiler invariants ● Object-oriented model ● Modularity ● Can we restrict macros so they're safer? – Data-centric metaprogramming