Highest-performance stream processing, in (Meta)OCaml and Scala 3

Highest-performance stream processing, in (Meta)OCaml and Scala 3

This talk, I gave at the 6th ML Day (Meta Language) event in Tokyo, introduces strymonas, a library design and implementation of stream fusion using Multi-Stage Programming and part of the new metaprogramming facilities, proposed for Scala 3.

The original implementation of strymonas was done in BER MetaOCaml and Scala using LMS. Since then it was ported in Scala 3. In the talk I introduce the key features of the new metaprogramming facilities, focused on what was needed for the port (inlining, staging/runtime code generation).

Strymonas is a joint work with Oleg Kiselyov, Nick Palladinos and Yannis Smaragdakis. The talk is based on my POPL '17 talk about Strymonas and the Scala Days '19 talk by Nicolas Stucki. Nicolas Stucki works at LAMP/EPFL implementing the mechanisms of quotes, splicing and inlining that constitute part of the metaprogramming facilities in Scala 3.

https://ml-lang.connpass.com/event/136687/

B81db221127979fbf254c4ffba7ba286?s=128

Aggelos Biboudis

July 06, 2019
Tweet

Transcript

  1. 2.

    Who am I? • Aggelos Biboudis (Άγγελος Μπιμπούδης) • PhD

    @ University of Athens at PLAST led by Yannis Smaragdakis did research on meta programming for streams and programming languages • Postdoc @ EPFL led by Martin Odersky on Scala 3 • Research focus on meta-programming
 
 2 twitter.com/biboudis
  2. 3.

    On the agenda 3 • Strymonas: A library design for

    stream fusion written in MetaOCaml, Scala 2/LMS and Scala 3 
 [talk based on the Stream Fusion, to Completeness talk-POPL17]
 (people: Oleg Kiselyov, Aggelos Biboudis, Nick Palladinos, Yannis Smaragdakis) • A glimpse of the Scala 3 meta-programming 
 (the subset of features used to port Strymonas)
 [talk based on ScalaDays19 by Nicolas Stucki]
  3. 5.

    – Matthew Fluet in [MLton-user] mailing list in 2014 https://sourceforge.net/p/mlton/mailman/message/33032176/

    “While we wait for the compiler to learn that optimization, we can verify that things would be better if we didn't conflate the driving of the "Stream.map" on the v2 and the "Stream.flatMap" on the v1 [about the cart benchmark]. To do so, make a code-clone of "Stream.ofArray" and use one for v1 and the other for v2” [to assist MLton generate nested loops] 5 2014: Experimenting with push-based streams in MLton
  4. 6.

    Goals To design a generative library for fast streams …

    • stream of elements, functionally • lazy, finite & infinite, sequential, one-shot => bulk processing • isolating the task: 
 in-memory, no batching, no windowing, no tensors, no forking … that supports a wide range and complex combinations of operators … … and generates loop-based, fused code with zero allocations. 6
  5. 8.

    Guaranteed Performance ✓ no intermediate results, no buffers ✓ no

    closure creation ✓ function calls should get inlined ✓ no deconstructions and constructions of tuples at 
 run-time 8
  6. 13.

    Multi-Stage Programming • think of code templates • quotes: brackets

    to create well-{formed, scoped, typed} templates 
 let c: int code = .< 1 + 2 >. • splices: create holes 
 let cf x = .< .~x + .~x >. • synthesize code 
 cf c ~> .< (1 + 2) + (1 + 2) >. • we can generate code at runt-time 13
  7. 14.

    Step 0: Naive Staging • start from an F-co-algebras signature

    (an Unfold) • sprinkle the code with staging annotations 14 type α stream = ∃σ. σ code * (σ code ! (α,σ) stream_shape code) type ('a,'z) stream_shape = | Nil | Cons of 'a * 'z binding time analysis
  8. 15.

    let map : ('a code -> 'b code) -> 'a

    stream -> 'b stream = fun f (s,step) -> let new_step = fun s -> .< match .~(step s) with | Nil -> Nil | Cons (a,t) -> Cons (.~(f .<a>.), t)>. in (s,new_step);; 15 Step 0: Naive Staging
  9. 16.

    Result (step 0) let rec loop_1 z_2 s_3 = match

    match match s_3 with | (i_4, arr_5) -> if i_4 < (Array.length arr_5) then Cons ((arr_5.(i_4)),((i_4 + 1), arr_5)) else Nil with | Nil -> Nil | Cons (a_6,t_7) -> Cons ((a_6 * a_6), t_7) with | Nil -> z_2 | Cons (a_8,t_9) -> loop_1 (z_2 + a_8) t_9 of_arr map sum 16 no intermediate ✓ function inlining ✓ various overheads ✗ ✗ ✗
  10. 17.

    Step 1: fusing the stepper let map : ('a code

    -> 'b code) -> 'a st_stream -> 'b st_stream = fun f (s, step) -> let new_step s k = step s @@ function | Nil -> k Nil | Cons (a,t) -> .<let a' = .~(f a) in .~(k @@ Cons (.<a'>., t))>. in (s, new_step) ;; 17 stream_shape is static and factored out of the dynamic code * Anders Bondorf. 1992. Improving binding times without explicit CPS-conversion. In LFP ’92 * Oleg Kiselyov, Why a program in CPS specializes better, http://okmij.org/ftp/meta-programming/#bti • stepper has known structure though!
  11. 18.

    Result let rec loop_1 z_2 s_3 = match s_3 with

    | (i_4, arr_5) -> if i_4 < (Array.length arr_5) then let el_6 = arr_5.(i_4) in let a'_7 = el_6 * el_6 in loop_1 (z_2 + a'_7) ((i_4 + 1), arr_5) else z_2 18 stepper inlined ✓ pattern matching ✗ ✗
  12. 19.

    Step 2: fusing the state let of_arr : 'a array

    code -> 'a st_stream = let init arr k = .< let i = ref 0 and arr = .~arr in .~(k (.<i>.,.<arr>.))>. and step (i,arr) k = .< if !(.~i) < Array.length .~arr then let el = (.~arr).(!(.~i)) in incr .~i; .~(k @@ Cons (.<el>., ())) else .~(k Nil)>. in fun arr -> (init arr,step) (int * α array) code ~> int ref code * α array code 19 • no pair-allocation in loop: state passed in and mutated
  13. 20.

    Result let i_8 = ref 0 and arr_9 = [|0;1;2;3;4|]

    in let rec loop_10 z_11 = if ! i_8 < Array.length arr_9 then let el_12 = arr_9.(! i_8) in incr i_8; let a'_13 = el_12 * el_12 in loop_10 (z_11+a'_13) else z_11 20 no pattern matching ✓ recursion ✗ ✗ ✗
  14. 21.

    Step 3: generating imperative loops let of_arr : 'a array

    code -> 'a stream = fun arr -> let init k = .<let arr = .~arr in .~(k .<arr>.)>. and upper_bound arr = .<Array.length .~arr - 1>. and index arr i k = .<let el = (.~arr).(.~i) in .~(k .<el>.)>. in (init, For {upb;index}) 21 start with For-form and if needed transform to Unfold
  15. 22.

    Result let s_1 = ref 0 in let arr_2 =

    [|0;1;2;3;4|] in for i_3 = 0 to (Array.length arr_2) - 1 do let el_4 = arr_2.(i_3) in let t_5 = el_4 * el_4 in s_1 := !s_1 + t_5 done; !s_1 22 loop-based/fused ✓
  16. 23.

    type card_t = AtMost1 | Many type (α,σ) producer_t =

    | For of {upb: σ ! int code; index: σ ! int code ! (α ! unit code) ! unit code} | Unfold of {term: σ ! bool code; card: card_t; step: σ ! (α ! unit code) ! unit code} and α producer = ∃σ. (∀ω. (σ ! ω code) ! ω code) * (α,σ) producer_t and α st_stream = | Linear of α producer | Nested of ∃β. β producer * (β ! α st_stream) and α stream = α code st_stream Final Datatype 23 • Linearity (filter and flat_map) • Sub-ranging and infinite streams 
 (take and unfold) • Fusing parallel streams (zip)
  17. 24.

    strymonas ported in Scala 3 • As a test-case on

    master branch — link • As a case study of the paper A Practical Unication of Multi-stage Programming and Macros, GPCE18, Boston (Nicolas Stucki, Aggelos Biboudis, Martin Odersky) — link 24
  18. 26.

    Scala 2.x • Semantic APIs/Scala 2.x macros (Eugene Burmako) •

    Scala Reflect: thin wrapper over compiler internals • but portability problem • but learning curve • Semantic APIs/tool writers: • Scalameta/SemanticDB (Eugene Burmako, Ólafur Páll Geirsson) 26
  19. 27.

    Scala 3.x/Dotty • Dotty gets dedicated language features for meta-programming

    • Dotty Semantic API: TASTy, a portable format • Dotty Macros: • Generative à la MetaOCaml • Analytical: • quote patterns • à la scala-reflect (based on tasty-reflect) • Semantic APIs/tool writers • SemanticDB experimental support in Dotty 27
  20. 28.

    What we will see today (https://dotty.epfl.ch/docs/reference/metaprogramming/toc.html) • Inline: inlining as

    a meta-programming feature • Match Types: computing new types • Generative metaprogr: Quotes & Splices • Compile-time =:= Macros • Run-time =:= Staging • Analytical metaprogr: • Pattern matching using quotes • Unseal a quote and analyse with TASTy Reflect (typed AST API) 28 needed for strymonas
  21. 29.

    Inline Definitions • Guaranteed inline • Potentially recursive • Potentially

    type specializing return type • Potentially macro entry point 29 inline def log[T](msg: String)(op: => T): T = ...
  22. 30.

    Inline Definitions 30 object Logger { var indent = 0

    inline def log[T](msg: String)(op: => T): T = println(s"${" " * indent}start $msg") indent += 1 val result = op indent -= 1 println(s"${" " * indent}$msg = $result") result } }
  23. 31.

    Inline Definitions 31 Logger.log("123L^5") { power(123L, 5) } val msg

    = "123L^5" println(s"${" " * indent}start $msg") Logger.indent += 1 val result = power(123L, 5) Logger.indent -= 1 println(s"${" " * indent}$msg = $result") result expands to
  24. 32.

    Recursive Inline 32 inline def power(x: Long, n: Int): Long

    = { if (n == 0) 1L else if (n % 2 == 1) x * power(x, n - 1) else { val y: Long = x * x power(y, n / 2) } } power(x, 10) val x = expr val y = x * x y * { val y2 = y * y val y3 = y2 * y2 y3 * 1L } power(x, n) boom expands to
  25. 33.

    Inline Parameters 33 inline def power(x: Long, inline n: Int):

    Int = ... The argument must be a known constant value • Primitive values: Boolean, Int, Double, String, …. • Some case classes: Option, ...
  26. 34.

    Inline Conditional 34 inline if (n == 0) 1L else

    inline if (n % 2 == 1) x * power(x, n - 1) • condition must be reduced • only one branch will remain
  27. 35.

    Inline Match 35 trait Nat case object Zero extends Nat

    case class Succ[N <: Nat](n: N) extends Nat inline def toInt(n: Nat): Int = inline n match { case Zero => 0 case Succ(n1) => toInt(n1) + 1 } val natTwo = toInt(Succ(Succ(Zero))) //effectively val natTwo: Int = 2 • one case must match the scrutinee • only one case will remain
  28. 36.

    Specializing inline (example 1) 36 inline def toInt(n: Nat) <:

    Int = ... val natTwo = toInt(Succ(Succ(Zero))) val natTwo: 2 = 2 // effectively this val natTwo: Int = 2 // instead of this
  29. 37.

    Specializing inline (example 2) 37 class A class B extends

    A { def meth(): Unit = ... } inline def choose(b: Boolean) <: A = inline if (b) new A() else new B() val a: A = choose(true) val b: B = choose(false) // error: meth() not defined on A choose(true).meth() // Ok choose(false).meth()
  30. 38.

    Inline implicit matches 38 inline def setFor[T]: Set[T] = implicit

    match { case ord: Ordering[T] => new TreeSet[T] case _ => new HashSet[T] } • implicit search available in a functional context • so if I summon an Ordering[String] then I get a Treeset
  31. 39.

    Quotes and Splices 39 val expr: Expr[T] = '{ e

    } '{ val e: T = ${ expr } } val t: Type[T] = '[ T ] '{ val e2: ${ t } = e } terms types quotes splices
  32. 40.

    Phase Consistency Check "For any free variable reference, the number

    of quoted scopes and the number of spliced scopes between the reference and its definition must be equal" *(inspired by MetaML’s levels/cross-safety check) def powerExpr(x: Expr[Long], n: Int): Expr[Long] = { if (n == 0) '{ 1L } else if (n % 2 == 1) '{ $x * ${ powerExpr(x, n - 1) } } else '{ val y: Long = $x * $x ${ powerExpr('y, n / 2) } } }
  33. 41.

    Macro definition through inline and top-level splice inline def power(x:

    Long, inline n: Int): Long = ${ powerExpr('x, n) } private def powerExpr(x: Expr[Long], n: Int): Expr[Long] = ...
  34. 42.

    Inline refs and Macros side- by-side inline def power(x: Long,

    inline n: Int): Long = { inline if (n == 0) 1L else inline if (n % 2 == 1) x * power(x, n - 1) else { val y: Long = x * x power(y, n / 2) } } def powerExpr(x: Expr[Long], n: Int): Expr[Long] = { if (n == 0) '{ 1L } else if (n % 2 == 1) '{ $x * ${ powerExpr(x, n - 1) } } else '{ val y: Long = $x * $x ${ powerExpr('y, n / 2) } } } • Conditions must be constant foldable by the compiler • Low syntactic overhead • Conditions must be constant foldable by the user • Arbitrary compiled code
  35. 43.

    Pattern Matching over Quotes inline def swap(tuple: =>(Int, Long)): (Long,

    Int) = ${ swapExpr('tuple) } val x = (1, 2L) swap(x) // inlines: x.swap swap((3, 4L)) // inlines: (4L, 3) def swapExpr(tuple: Expr[(Int, Long)]) given Reflection: Expr[(Long, Int)] = { tuple match { case '{ ($x1, $x2) } => '{ ($x2, $x1) } case _ => '{ $tuple.swap } } }
  36. 44.

    Staging = macros + run // make available the necessary

    toolbox for runtime code generation implicit val toolbox: scala.quoted.Toolbox = scala.quoted.Toolbox.make(getClass.getClassLoader) val f: Array[Int] => Int = run { val stagedSum = '{ (arr: Array[Int]) => ${sum('arr)}} println(stagedSum.show) stagedSum } f.apply(Array(1, 2, 3)) // Returns 6
  37. 45.

    Basic points • stream-fusion is a domain-specific optimization • domain-specific

    optimizations are better tackled outside the general purpose compiler • multi-stage programming is not a trivial task • Scala 3 offers dedicate support for metaprogramming based on a portable format, MetaOCaml projects can be ported with minimal effort 45