Upgrade to Pro — share decks privately, control downloads, hide ads and more …

State of Scala API in Apache Flink

State of Scala API in Apache Flink

As a Scala developer writing new Flink job, you expect to use latest Scala 3 version, rather the one Flink was compiled with. Support of Scala 2.13 and Scala 3 was not really possible until Flink 1.15 came out. In this talk we will review how the Scala API was done in Apache Flink prior the version 1.15 and what has changed in that release. Apache Flink chose quite opposite way to enable Scala developers to use any Scala version than Apache Spark project and that is interesting discussion on its own.

Alexey Novakov

September 04, 2023
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. CONTENTS 1. Why use Scala 2. Usage of Scala in

    the Apache Flink 3. Apache Flink Scala API 4. Scala tools for Flink Jobs development
  2. Why use Scala Scala is more than 15 years old

    programming language with mature eco-system of tools, libraries and books @main def hello() = println("Hello, World!") - Expressive and concise syntax. Support of scripting - Unique language features with support of FP&OOP - Compiles to JVM, JavaScript and Native code - Spark, Flink, Akka, Kafka: all are using Scala
  3. 1. Editors: VSCode with Metals plugin, IntelliJ Idea with Scala

    plugin https://scalameta.org/metals/docs/editors/vscode/ 2. REPL: console, Ammonite 3. CLI: scala-cli, Ammonite 4. Build tools: Mill 5. Libraries/Frameworks: scalatest, ZIO, Cats, Akka HTTP, Spark, Play, fs2, Slick, and more 6. Library Registry: https://index.scala-lang.org/
  4. Scala Books I recommend these personally … Programming in Scala,

    Fifth Edition Martin Odersky Functional Programming in Scala Paul Chiusano and Runar Bjarnason Scala Cookbook Alexander Alvin Scala for the Impatient Cay S. Horstmann
  5. Scala Versions Scala 2.13 released on Jun 7, 2019 Scala

    3.0 released on May 21, 2021 Scala 2.12 released on Oct 28, 2016 Flink Scala API is still on 2.12 Binaries compiled with different versions can depend on each other
  6. Depedency Tree: before Flink 1.15 Scala is coupled Flink Modules

    in Java & Scala Scala 2.11, 2.12 std. library Apache Flink modules Compile-time dependency DataStream Scala or Java API User app modules Flink Job in Scala Scala 2.11, 2.12 std. library Scala 2.13, 3.x std. library Switch to new Scala is not possible
  7. Depedency Tree: since Flink 1.15 Scala is no longer tightly

    coupled Flink Modules in Java & Scala Shaded Scala 2.12 std. library Apache Flink modules Compile-time dependency DataStream Java API User app modules Flink Job in Scala Scala 2.13, 3.x std. library Switch to newer Scala is possible https://flink.apache.org/2022/02/22/scala-free-in-one-fifteen/
  8. • Flink’s Scala version is shaded and does not clash

    with user’s Scala • To use Scala 2.13 or 3.x remove flink-scala JAR from the Flink distribution: $ rm flink-dist/lib/flink-scala* @main def job = val env = StreamExecutionEnvironment.getExecutionEnvironment env .fromElements(1, 2, 3, 4, 5, 6) .filter(_ % 2 == 1).map(i => i * i).print() env.execute() • And then use Java API from your Scala code: However, users have to provide Scala serializers. See more later
  9. Flink PMCs Decision 1. Develop in Scala via Java API

    • Pros: you can freely choose latest Scala version since Flink 1.15 • Cons: it requires to define your own serializers 2. All Flink Scala APIs are deprecated and will be removed in a future Flink version 3. Internal Scala modules will be kept or rewritten in Java (if possible) The Future of Scala in Apache Flink Background: attempt to add support for Scala 2.13 was failed (FLINK-13414)
  10. Official Scala API Extension Add an import for the DataStream

    API import org.apache.flink.api.scala._ object Main extends App { val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements( "To be, or not to be,--that is the question:--", "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,") val counts = text .flatMap(value => value.split("\\s+")) .map(value => (value,1)) .groupBy(0) .sum(1) counts.writeAsCsv("output.txt", "\n", " ") env.execute("Scala WordCount Example") } https://index.scala- lang.org/apache/flink/artifacts/flink-streaming- scala/1.17.1?binary-version=_2.12
  11. Ways to use new Scala in Flink Since Flink 1.15

    only 1. flink-scala-api: a fork of Flink Scala bindings originally created by Findify https://github.com/flink-extended/flink-scala-api (Scala support: 2.12, 2.13, 3.x) 2. flink4s: Scala 3.x wrapper for Apache Flink https://github.com/ariskk/flink4s "org.apache.flink" % "flink-streaming-java" % "1.15.0” // or newer 3. Direct usage of Flink Java API Caution: you need to provide your type serializers
  12. Migration // original API import import org.apache.flink.streaming.api.scala._ // flink-scala-api imports

    import org.apache.flink.api._ import org.apache.flink.api.serializers._ Usage libraryDependencies += "org.flinkextended" %% "flink-scala-api" % "1.16.2_1.0.0" "1.15.4_1.0.0" "1.17.1_1.0.0" Choose your version
  13. Example Job (flink-extended/flink-scala-api) import org.apache.flink.api._ import org.apache.flink.api.serializers._ @main def SocketTextStreamWordCount(hostName:

    String, port: Int) = val env = StreamExecutionEnvironment.getExecutionEnvironment env .socketTextStream(hostName, port) .flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty)) .map((_, 1)) .keyBy(_._1) .sum(1).print() env.execute("Scala SocketTextStreamWordCount Example") Connecting to server socket localhost:9999 [info] 3> (hello,1) [info] 8> (flink,1) [info] 1> (scala,1) [info] 1> (api,1) % nc -lk 9999 hello flink scala api Terminal 1 Terminal 2 (sbt run)
  14. Serializer Derivation flink-extended/flink-scala-api: import org.apache.flink.api.serializers._ case class Foo(x: Int) {

    def inc(a: Int) = copy(x = x + a) } // defined explicitly for caching purpose. // If not defined, then it is derived automatically implicit lazy val fooTypeInfo: TypeInformation[Foo] = deriveTypeInformation[Foo] env .fromElements(Foo(1),Foo(2),Foo(3)) .map(x => x.inc(1)) // taken as an implicit .map(x => x.inc(2)) // again, no re-derivation
  15. Main Features flink-extended/flink-scala-api - Automatic compile-time derivation of Flink serializers

    for simple Scala and Algebraic Data Types - Zero runtime reflection - No silent fallback to Kryo serialization (compile error) - Extdendable with custom serializers for deeply-nested types - Easy to migrate: mimics old Scala API - Scala 3 support
  16. sbt assembly plugin To build a fat-jar: addSbtPlugin("com.eed3si9n" % "sbt-assembly"

    % "2.0.0") # file: project/plugins.sbt # file: build.sbt lazy val root = (project in file(".")) .settings( // define Main Class in case there are many assembly / mainClass := Some("org.example.MyMainClass"), … ) > sbt assembly > ls target/scala-3*/*.jar target/scala-3.3.0/my-flink-project-0.1.jar
  17. scala-cli It can compile, run, package and more scala-cli package

    --jvm 11 \ multisetToString.scala \ -o udfs.jar \ --library -f //> using scala "3" //> using dep "org.apache.flink:flink-table-api-java:1.15.4" import org.apache.flink.table.functions.ScalarFunction import org.apache.flink.table.annotation.DataTypeHint import java.util.{Map => JMap} class MultisetToString extends ScalarFunction: def eval( @DataTypeHint("MULTISET<INT>") mset: JMap[ Integer, String ] ) = mset.toString multisetToString.scala Just one file and single command packages a UDF into a JAR
  18. @ import $ivy.`org.flinkextended::flink-scala-api:1.16.2_1.0.0` @ import $ivy.`org.apache.flink:flink-clients:1.16.2` @ import org.apache.flink.api._ @

    import org.apache.flink.api.serializers._ @ val env = StreamExecutionEnvironment.getExecutionEnvironment env: StreamExecutionEnvironment = org.apache.flink.api.StreamExecutionEnvironment@1e226bcd @ env.fromElements(1, 2, 3, 4, 5, 6).filter(_ % 2 == 1).map(i => i * i).print() res5: org.apache.flink.streaming.api.datastream.DataStreamSink[Int] = org.apache.flink.streaming.api.datastream.DataStreamSink@71e2c6d8 @ env.execute() 4> 1 8> 25 6> 9 res6: common.JobExecutionResult = Program execution finished Job with JobID 5a947a757f4e74c2a06dcfe80ba4fde8 has finished. Job Runtime: 345 ms See more at https://ammonite.io Add dependencies Local mode Result
  19. Jupyter Notebook with Scala kernel Jupyter+Almond provides similar user experience

    as Apache Zeppelin Almond A Scala kernel for Jupyter https://almond.sh/
  20. Flink Job Template Install SBT first, then run: > sbt

    new novakov-alexey/flink-scala-api.g8 new-flink-app ├── build.sbt ├── project │ └── build.properties └── src └── main └── scala └── com └── example └── WordCount.scala name [My Flink Scala Project]: new-flink-app flinkVersion [1.17.1]: // press enter to use 1.17.1 Template applied in /Users/myhome/dev/git/./new-flink-app Above command generates “WordCount” Flink job in Scala 3
  21. You can use latest Scala in your Flink jobs. There

    are 2 open-source wrappers for Scala available Scala eco-system provides better tools for Flink jobs development, debug and deployment: coursier, scala-cli, ammonite, sbt, scastie Large code-bases in Scala remain maintainable unlike in Java If you follow functional programming paradigm in your Flink jobs, then it is even more beneficial for long-term maintenance Try to develop your next job with flink-scala-api Learn more at https://www.scala-lang.org/