State of Scala API in Apache Flink

Slide 1

Slide 1 text

Alexey Novakov, Solution Architect at Ververica State of Scala API in Apache Flink

Slide 2

Slide 2 text

CONTENTS 1. Why use Scala 2. Usage of Scala in the Apache Flink 3. Apache Flink Scala API 4. Scala tools for Flink Jobs development

Slide 3

Slide 3 text

Part 01 Why use Scala

Slide 4

Slide 4 text

Why use Scala Scala is more than 15 years old programming language with mature eco-system of tools, libraries and books @main def hello() = println("Hello, World!") - Expressive and concise syntax. Support of scripting - Unique language features with support of FP&OOP - Compiles to JVM, JavaScript and Native code - Spark, Flink, Akka, Kafka: all are using Scala

Slide 5

Slide 5 text

1. Editors: VSCode with Metals plugin, IntelliJ Idea with Scala plugin https://scalameta.org/metals/docs/editors/vscode/ 2. REPL: console, Ammonite 3. CLI: scala-cli, Ammonite 4. Build tools: Mill 5. Libraries/Frameworks: scalatest, ZIO, Cats, Akka HTTP, Spark, Play, fs2, Slick, and more 6. Library Registry: https://index.scala-lang.org/

Slide 6

Slide 6 text

Scala Books I recommend these personally … Programming in Scala, Fifth Edition Martin Odersky Functional Programming in Scala Paul Chiusano and Runar Bjarnason Scala Cookbook Alexander Alvin Scala for the Impatient Cay S. Horstmann

Slide 7

Slide 7 text

Part 02 Scala in Apache Flink

Slide 8

Slide 8 text

Scala Versions Scala 2.13 released on Jun 7, 2019 Scala 3.0 released on May 21, 2021 Scala 2.12 released on Oct 28, 2016 Flink Scala API is still on 2.12 Binaries compiled with different versions can depend on each other

Slide 9

Slide 9 text

Depedency Tree: before Flink 1.15 Scala is coupled Flink Modules in Java & Scala Scala 2.11, 2.12 std. library Apache Flink modules Compile-time dependency DataStream Scala or Java API User app modules Flink Job in Scala Scala 2.11, 2.12 std. library Scala 2.13, 3.x std. library Switch to new Scala is not possible

Slide 10

Slide 10 text

Depedency Tree: since Flink 1.15 Scala is no longer tightly coupled Flink Modules in Java & Scala Shaded Scala 2.12 std. library Apache Flink modules Compile-time dependency DataStream Java API User app modules Flink Job in Scala Scala 2.13, 3.x std. library Switch to newer Scala is possible https://flink.apache.org/2022/02/22/scala-free-in-one-fifteen/

Slide 11

Slide 11 text

• Flink’s Scala version is shaded and does not clash with user’s Scala • To use Scala 2.13 or 3.x remove flink-scala JAR from the Flink distribution: $ rm flink-dist/lib/flink-scala* @main def job = val env = StreamExecutionEnvironment.getExecutionEnvironment env .fromElements(1, 2, 3, 4, 5, 6) .filter(_ % 2 == 1).map(i => i * i).print() env.execute() • And then use Java API from your Scala code: However, users have to provide Scala serializers. See more later

Slide 12

Slide 12 text

Flink PMCs Decision 1. Develop in Scala via Java API • Pros: you can freely choose latest Scala version since Flink 1.15 • Cons: it requires to define your own serializers 2. All Flink Scala APIs are deprecated and will be removed in a future Flink version 3. Internal Scala modules will be kept or rewritten in Java (if possible) The Future of Scala in Apache Flink Background: attempt to add support for Scala 2.13 was failed (FLINK-13414)

Slide 13

Slide 13 text

Part 03 Apache Flink Scala API

Slide 14

Slide 14 text

Official Scala API Extension Add an import for the DataStream API import org.apache.flink.api.scala._ object Main extends App { val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements( "To be, or not to be,--that is the question:--", "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,") val counts = text .flatMap(value => value.split("\\s+")) .map(value => (value,1)) .groupBy(0) .sum(1) counts.writeAsCsv("output.txt", "\n", " ") env.execute("Scala WordCount Example") } https://index.scala- lang.org/apache/flink/artifacts/flink-streaming- scala/1.17.1?binary-version=_2.12

Slide 15

Slide 15 text

Ways to use new Scala in Flink Since Flink 1.15 only 1. flink-scala-api: a fork of Flink Scala bindings originally created by Findify https://github.com/flink-extended/flink-scala-api (Scala support: 2.12, 2.13, 3.x) 2. flink4s: Scala 3.x wrapper for Apache Flink https://github.com/ariskk/flink4s "org.apache.flink" % "flink-streaming-java" % "1.15.0” // or newer 3. Direct usage of Flink Java API Caution: you need to provide your type serializers

Slide 16

Slide 16 text

Migration // original API import import org.apache.flink.streaming.api.scala._ // flink-scala-api imports import org.apache.flink.api._ import org.apache.flink.api.serializers._ Usage libraryDependencies += "org.flinkextended" %% "flink-scala-api" % "1.16.2_1.0.0" "1.15.4_1.0.0" "1.17.1_1.0.0" Choose your version

Slide 17

Slide 17 text

Example Job (flink-extended/flink-scala-api) import org.apache.flink.api._ import org.apache.flink.api.serializers._ @main def SocketTextStreamWordCount(hostName: String, port: Int) = val env = StreamExecutionEnvironment.getExecutionEnvironment env .socketTextStream(hostName, port) .flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty)) .map((_, 1)) .keyBy(_._1) .sum(1).print() env.execute("Scala SocketTextStreamWordCount Example") Connecting to server socket localhost:9999 [info] 3> (hello,1) [info] 8> (flink,1) [info] 1> (scala,1) [info] 1> (api,1) % nc -lk 9999 hello flink scala api Terminal 1 Terminal 2 (sbt run)

Slide 18

Slide 18 text

Serializer Derivation flink-extended/flink-scala-api: import org.apache.flink.api.serializers._ case class Foo(x: Int) { def inc(a: Int) = copy(x = x + a) } // defined explicitly for caching purpose. // If not defined, then it is derived automatically implicit lazy val fooTypeInfo: TypeInformation[Foo] = deriveTypeInformation[Foo] env .fromElements(Foo(1),Foo(2),Foo(3)) .map(x => x.inc(1)) // taken as an implicit .map(x => x.inc(2)) // again, no re-derivation

Slide 19

Slide 19 text

Main Features flink-extended/flink-scala-api - Automatic compile-time derivation of Flink serializers for simple Scala and Algebraic Data Types - Zero runtime reflection - No silent fallback to Kryo serialization (compile error) - Extdendable with custom serializers for deeply-nested types - Easy to migrate: mimics old Scala API - Scala 3 support

Slide 20

Slide 20 text

Part 04 Scala tools for Flink Jobs development

Slide 21

Slide 21 text

sbt assembly plugin To build a fat-jar: addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.0.0") # file: project/plugins.sbt # file: build.sbt lazy val root = (project in file(".")) .settings( // define Main Class in case there are many assembly / mainClass := Some("org.example.MyMainClass"), … ) > sbt assembly > ls target/scala-3*/*.jar target/scala-3.3.0/my-flink-project-0.1.jar

Slide 22

Slide 22 text

scala-cli It can compile, run, package and more scala-cli package --jvm 11 \ multisetToString.scala \ -o udfs.jar \ --library -f //> using scala "3" //> using dep "org.apache.flink:flink-table-api-java:1.15.4" import org.apache.flink.table.functions.ScalarFunction import org.apache.flink.table.annotation.DataTypeHint import java.util.{Map => JMap} class MultisetToString extends ScalarFunction: def eval( @DataTypeHint("MULTISET") mset: JMap[ Integer, String ] ) = mset.toString multisetToString.scala Just one file and single command packages a UDF into a JAR

Slide 23

Slide 23 text

@ import $ivy.`org.flinkextended::flink-scala-api:1.16.2_1.0.0` @ import $ivy.`org.apache.flink:flink-clients:1.16.2` @ import org.apache.flink.api._ @ import org.apache.flink.api.serializers._ @ val env = StreamExecutionEnvironment.getExecutionEnvironment env: StreamExecutionEnvironment = org.apache.flink.api.StreamExecutionEnvironment@1e226bcd @ env.fromElements(1, 2, 3, 4, 5, 6).filter(_ % 2 == 1).map(i => i * i).print() res5: org.apache.flink.streaming.api.datastream.DataStreamSink[Int] = org.apache.flink.streaming.api.datastream.DataStreamSink@71e2c6d8 @ env.execute() 4> 1 8> 25 6> 9 res6: common.JobExecutionResult = Program execution finished Job with JobID 5a947a757f4e74c2a06dcfe80ba4fde8 has finished. Job Runtime: 345 ms See more at https://ammonite.io Add dependencies Local mode Result

Slide 24

Slide 24 text

Jupyter Notebook with Scala kernel Jupyter+Almond provides similar user experience as Apache Zeppelin Almond A Scala kernel for Jupyter https://almond.sh/

Slide 25

Slide 25 text

Flink Job Template Install SBT first, then run: > sbt new novakov-alexey/flink-scala-api.g8 new-flink-app ├── build.sbt ├── project │ └── build.properties └── src └── main └── scala └── com └── example └── WordCount.scala name [My Flink Scala Project]: new-flink-app flinkVersion [1.17.1]: // press enter to use 1.17.1 Template applied in /Users/myhome/dev/git/./new-flink-app Above command generates “WordCount” Flink job in Scala 3

Slide 26

Slide 26 text

You can use latest Scala in your Flink jobs. There are 2 open-source wrappers for Scala available Scala eco-system provides better tools for Flink jobs development, debug and deployment: coursier, scala-cli, ammonite, sbt, scastie Large code-bases in Scala remain maintainable unlike in Java If you follow functional programming paradigm in your Flink jobs, then it is even more beneficial for long-term maintenance Try to develop your next job with flink-scala-api Learn more at https://www.scala-lang.org/

Slide 27

Slide 27 text

Thanks Alexey Novakov, Solution Architect Contact info alexey at ververica.com