Slide 1

Slide 1 text

Highest-performance stream processing, in (Meta)OCaml and Scala 3 Aggelos Biboudis LAMP/EPFL ML勉勉強会, Tokyo 6 July 2019

Slide 2

Slide 2 text

Who am I? • Aggelos Biboudis (Άγγελος Μπιμπούδης) • PhD @ University of Athens at PLAST led by Yannis Smaragdakis did research on meta programming for streams and programming languages • Postdoc @ EPFL led by Martin Odersky on Scala 3 • Research focus on meta-programming
 
 2 twitter.com/biboudis

Slide 3

Slide 3 text

On the agenda 3 • Strymonas: A library design for stream fusion written in MetaOCaml, Scala 2/LMS and Scala 3 
 [talk based on the Stream Fusion, to Completeness talk-POPL17]
 (people: Oleg Kiselyov, Aggelos Biboudis, Nick Palladinos, Yannis Smaragdakis) • A glimpse of the Scala 3 meta-programming 
 (the subset of features used to port Strymonas)
 [talk based on ScalaDays19 by Nicolas Stucki]

Slide 4

Slide 4 text

Back in 2014: A pet project 4

Slide 5

Slide 5 text

– Matthew Fluet in [MLton-user] mailing list in 2014 https://sourceforge.net/p/mlton/mailman/message/33032176/ “While we wait for the compiler to learn that optimization, we can verify that things would be better if we didn't conflate the driving of the "Stream.map" on the v2 and the "Stream.flatMap" on the v1 [about the cart benchmark]. To do so, make a code-clone of "Stream.ofArray" and use one for v1 and the other for v2” [to assist MLton generate nested loops] 5 2014: Experimenting with push-based streams in MLton

Slide 6

Slide 6 text

Goals To design a generative library for fast streams … • stream of elements, functionally • lazy, finite & infinite, sequential, one-shot => bulk processing • isolating the task: 
 in-memory, no batching, no windowing, no tensors, no forking … that supports a wide range and complex combinations of operators … … and generates loop-based, fused code with zero allocations. 6

Slide 7

Slide 7 text

7 strymonas.github.io

Slide 8

Slide 8 text

Guaranteed Performance ✓ no intermediate results, no buffers ✓ no closure creation ✓ function calls should get inlined ✓ no deconstructions and constructions of tuples at 
 run-time 8

Slide 9

Slide 9 text

Staging Stream Fusion 9 staging also our main example for this talk

Slide 10

Slide 10 text

Staging Stream Fusion 10 and much more complex...

Slide 11

Slide 11 text

Benchmarks 11 Scala/LMS

Slide 12

Slide 12 text

Benchmarks 12 OCaml/BER MetaOCaml

Slide 13

Slide 13 text

Multi-Stage Programming • think of code templates • quotes: brackets to create well-{formed, scoped, typed} templates 
 let c: int code = .< 1 + 2 >. • splices: create holes 
 let cf x = .< .~x + .~x >. • synthesize code 
 cf c ~> .< (1 + 2) + (1 + 2) >. • we can generate code at runt-time 13

Slide 14

Slide 14 text

Step 0: Naive Staging • start from an F-co-algebras signature (an Unfold) • sprinkle the code with staging annotations 14 type α stream = ∃σ. σ code * (σ code ! (α,σ) stream_shape code) type ('a,'z) stream_shape = | Nil | Cons of 'a * 'z binding time analysis

Slide 15

Slide 15 text

let map : ('a code -> 'b code) -> 'a stream -> 'b stream = fun f (s,step) -> let new_step = fun s -> .< match .~(step s) with | Nil -> Nil | Cons (a,t) -> Cons (.~(f ..), t)>. in (s,new_step);; 15 Step 0: Naive Staging

Slide 16

Slide 16 text

Result (step 0) let rec loop_1 z_2 s_3 = match match match s_3 with | (i_4, arr_5) -> if i_4 < (Array.length arr_5) then Cons ((arr_5.(i_4)),((i_4 + 1), arr_5)) else Nil with | Nil -> Nil | Cons (a_6,t_7) -> Cons ((a_6 * a_6), t_7) with | Nil -> z_2 | Cons (a_8,t_9) -> loop_1 (z_2 + a_8) t_9 of_arr map sum 16 no intermediate ✓ function inlining ✓ various overheads ✗ ✗ ✗

Slide 17

Slide 17 text

Step 1: fusing the stepper let map : ('a code -> 'b code) -> 'a st_stream -> 'b st_stream = fun f (s, step) -> let new_step s k = step s @@ function | Nil -> k Nil | Cons (a,t) -> .., t))>. in (s, new_step) ;; 17 stream_shape is static and factored out of the dynamic code * Anders Bondorf. 1992. Improving binding times without explicit CPS-conversion. In LFP ’92 * Oleg Kiselyov, Why a program in CPS specializes better, http://okmij.org/ftp/meta-programming/#bti • stepper has known structure though!

Slide 18

Slide 18 text

Result let rec loop_1 z_2 s_3 = match s_3 with | (i_4, arr_5) -> if i_4 < (Array.length arr_5) then let el_6 = arr_5.(i_4) in let a'_7 = el_6 * el_6 in loop_1 (z_2 + a'_7) ((i_4 + 1), arr_5) else z_2 18 stepper inlined ✓ pattern matching ✗ ✗

Slide 19

Slide 19 text

Step 2: fusing the state let of_arr : 'a array code -> 'a st_stream = let init arr k = .< let i = ref 0 and arr = .~arr in .~(k (..,..))>. and step (i,arr) k = .< if !(.~i) < Array.length .~arr then let el = (.~arr).(!(.~i)) in incr .~i; .~(k @@ Cons (.., ())) else .~(k Nil)>. in fun arr -> (init arr,step) (int * α array) code ~> int ref code * α array code 19 • no pair-allocation in loop: state passed in and mutated

Slide 20

Slide 20 text

Result let i_8 = ref 0 and arr_9 = [|0;1;2;3;4|] in let rec loop_10 z_11 = if ! i_8 < Array.length arr_9 then let el_12 = arr_9.(! i_8) in incr i_8; let a'_13 = el_12 * el_12 in loop_10 (z_11+a'_13) else z_11 20 no pattern matching ✓ recursion ✗ ✗ ✗

Slide 21

Slide 21 text

Step 3: generating imperative loops let of_arr : 'a array code -> 'a stream = fun arr -> let init k = ..)>. and upper_bound arr = .. and index arr i k = ..)>. in (init, For {upb;index}) 21 start with For-form and if needed transform to Unfold

Slide 22

Slide 22 text

Result let s_1 = ref 0 in let arr_2 = [|0;1;2;3;4|] in for i_3 = 0 to (Array.length arr_2) - 1 do let el_4 = arr_2.(i_3) in let t_5 = el_4 * el_4 in s_1 := !s_1 + t_5 done; !s_1 22 loop-based/fused ✓

Slide 23

Slide 23 text

type card_t = AtMost1 | Many type (α,σ) producer_t = | For of {upb: σ ! int code; index: σ ! int code ! (α ! unit code) ! unit code} | Unfold of {term: σ ! bool code; card: card_t; step: σ ! (α ! unit code) ! unit code} and α producer = ∃σ. (∀ω. (σ ! ω code) ! ω code) * (α,σ) producer_t and α st_stream = | Linear of α producer | Nested of ∃β. β producer * (β ! α st_stream) and α stream = α code st_stream Final Datatype 23 • Linearity (filter and flat_map) • Sub-ranging and infinite streams 
 (take and unfold) • Fusing parallel streams (zip)

Slide 24

Slide 24 text

strymonas ported in Scala 3 • As a test-case on master branch — link • As a case study of the paper A Practical Unication of Multi-stage Programming and Macros, GPCE18, Boston (Nicolas Stucki, Aggelos Biboudis, Martin Odersky) — link 24

Slide 25

Slide 25 text

Part 2: Metaprogramming in Scala 3 25

Slide 26

Slide 26 text

Scala 2.x • Semantic APIs/Scala 2.x macros (Eugene Burmako) • Scala Reflect: thin wrapper over compiler internals • but portability problem • but learning curve • Semantic APIs/tool writers: • Scalameta/SemanticDB (Eugene Burmako, Ólafur Páll Geirsson) 26

Slide 27

Slide 27 text

Scala 3.x/Dotty • Dotty gets dedicated language features for meta-programming • Dotty Semantic API: TASTy, a portable format • Dotty Macros: • Generative à la MetaOCaml • Analytical: • quote patterns • à la scala-reflect (based on tasty-reflect) • Semantic APIs/tool writers • SemanticDB experimental support in Dotty 27

Slide 28

Slide 28 text

What we will see today (https://dotty.epfl.ch/docs/reference/metaprogramming/toc.html) • Inline: inlining as a meta-programming feature • Match Types: computing new types • Generative metaprogr: Quotes & Splices • Compile-time =:= Macros • Run-time =:= Staging • Analytical metaprogr: • Pattern matching using quotes • Unseal a quote and analyse with TASTy Reflect (typed AST API) 28 needed for strymonas

Slide 29

Slide 29 text

Inline Definitions • Guaranteed inline • Potentially recursive • Potentially type specializing return type • Potentially macro entry point 29 inline def log[T](msg: String)(op: => T): T = ...

Slide 30

Slide 30 text

Inline Definitions 30 object Logger { var indent = 0 inline def log[T](msg: String)(op: => T): T = println(s"${" " * indent}start $msg") indent += 1 val result = op indent -= 1 println(s"${" " * indent}$msg = $result") result } }

Slide 31

Slide 31 text

Inline Definitions 31 Logger.log("123L^5") { power(123L, 5) } val msg = "123L^5" println(s"${" " * indent}start $msg") Logger.indent += 1 val result = power(123L, 5) Logger.indent -= 1 println(s"${" " * indent}$msg = $result") result expands to

Slide 32

Slide 32 text

Recursive Inline 32 inline def power(x: Long, n: Int): Long = { if (n == 0) 1L else if (n % 2 == 1) x * power(x, n - 1) else { val y: Long = x * x power(y, n / 2) } } power(x, 10) val x = expr val y = x * x y * { val y2 = y * y val y3 = y2 * y2 y3 * 1L } power(x, n) boom expands to

Slide 33

Slide 33 text

Inline Parameters 33 inline def power(x: Long, inline n: Int): Int = ... The argument must be a known constant value • Primitive values: Boolean, Int, Double, String, …. • Some case classes: Option, ...

Slide 34

Slide 34 text

Inline Conditional 34 inline if (n == 0) 1L else inline if (n % 2 == 1) x * power(x, n - 1) • condition must be reduced • only one branch will remain

Slide 35

Slide 35 text

Inline Match 35 trait Nat case object Zero extends Nat case class Succ[N <: Nat](n: N) extends Nat inline def toInt(n: Nat): Int = inline n match { case Zero => 0 case Succ(n1) => toInt(n1) + 1 } val natTwo = toInt(Succ(Succ(Zero))) //effectively val natTwo: Int = 2 • one case must match the scrutinee • only one case will remain

Slide 36

Slide 36 text

Specializing inline (example 1) 36 inline def toInt(n: Nat) <: Int = ... val natTwo = toInt(Succ(Succ(Zero))) val natTwo: 2 = 2 // effectively this val natTwo: Int = 2 // instead of this

Slide 37

Slide 37 text

Specializing inline (example 2) 37 class A class B extends A { def meth(): Unit = ... } inline def choose(b: Boolean) <: A = inline if (b) new A() else new B() val a: A = choose(true) val b: B = choose(false) // error: meth() not defined on A choose(true).meth() // Ok choose(false).meth()

Slide 38

Slide 38 text

Inline implicit matches 38 inline def setFor[T]: Set[T] = implicit match { case ord: Ordering[T] => new TreeSet[T] case _ => new HashSet[T] } • implicit search available in a functional context • so if I summon an Ordering[String] then I get a Treeset

Slide 39

Slide 39 text

Quotes and Splices 39 val expr: Expr[T] = '{ e } '{ val e: T = ${ expr } } val t: Type[T] = '[ T ] '{ val e2: ${ t } = e } terms types quotes splices

Slide 40

Slide 40 text

Phase Consistency Check "For any free variable reference, the number of quoted scopes and the number of spliced scopes between the reference and its definition must be equal" *(inspired by MetaML’s levels/cross-safety check) def powerExpr(x: Expr[Long], n: Int): Expr[Long] = { if (n == 0) '{ 1L } else if (n % 2 == 1) '{ $x * ${ powerExpr(x, n - 1) } } else '{ val y: Long = $x * $x ${ powerExpr('y, n / 2) } } }

Slide 41

Slide 41 text

Macro definition through inline and top-level splice inline def power(x: Long, inline n: Int): Long = ${ powerExpr('x, n) } private def powerExpr(x: Expr[Long], n: Int): Expr[Long] = ...

Slide 42

Slide 42 text

Inline refs and Macros side- by-side inline def power(x: Long, inline n: Int): Long = { inline if (n == 0) 1L else inline if (n % 2 == 1) x * power(x, n - 1) else { val y: Long = x * x power(y, n / 2) } } def powerExpr(x: Expr[Long], n: Int): Expr[Long] = { if (n == 0) '{ 1L } else if (n % 2 == 1) '{ $x * ${ powerExpr(x, n - 1) } } else '{ val y: Long = $x * $x ${ powerExpr('y, n / 2) } } } • Conditions must be constant foldable by the compiler • Low syntactic overhead • Conditions must be constant foldable by the user • Arbitrary compiled code

Slide 43

Slide 43 text

Pattern Matching over Quotes inline def swap(tuple: =>(Int, Long)): (Long, Int) = ${ swapExpr('tuple) } val x = (1, 2L) swap(x) // inlines: x.swap swap((3, 4L)) // inlines: (4L, 3) def swapExpr(tuple: Expr[(Int, Long)]) given Reflection: Expr[(Long, Int)] = { tuple match { case '{ ($x1, $x2) } => '{ ($x2, $x1) } case _ => '{ $tuple.swap } } }

Slide 44

Slide 44 text

Staging = macros + run // make available the necessary toolbox for runtime code generation implicit val toolbox: scala.quoted.Toolbox = scala.quoted.Toolbox.make(getClass.getClassLoader) val f: Array[Int] => Int = run { val stagedSum = '{ (arr: Array[Int]) => ${sum('arr)}} println(stagedSum.show) stagedSum } f.apply(Array(1, 2, 3)) // Returns 6

Slide 45

Slide 45 text

Basic points • stream-fusion is a domain-specific optimization • domain-specific optimizations are better tackled outside the general purpose compiler • multi-stage programming is not a trivial task • Scala 3 offers dedicate support for metaprogramming based on a portable format, MetaOCaml projects can be ported with minimal effort 45

Slide 46

Slide 46 text

ご清聴ありがとうございました 46 twitter.com/biboudis