Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flink Forward 2023: State of Scala API in Apach...

Flink Forward 2023: State of Scala API in Apache Flink

This talk is about state of Scala API for Apache Flink and how to use newer Scala version with the latest Apache Flink version

Alexey Novakov

November 08, 2023
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. Flink Forward 2023 © State of Scala API in Apache

    Flink Alexey Novakov, Solution Architect @ Ververica
  2. Flink Forward 2023 © Contents 1. Why use Scala 2.

    Usage of Scala in Apache Flink 3. Apache Flink Scala API 4. Scala tools for Flink jobs
  3. Flink Forward 2023 © Why use Scala ◦ Expressive and

    concise syntax. Support of scripting ◦ Unique language features with support of FP and OOP ◦ Compiles to JVM, JavaScript and Native code ◦ Spark, Flink, Akka, Kafka: all are using Scala Scala is more than 15 years old programming language with mature eco-system of tools, libraries and many books @main def hello() = println("Hello, World!")
  4. Flink Forward 2023 © 1. Editors: VSCode with Metals plugin,

    IntelliJ Idea with Scala plugin 2. REPL: console, Ammonite 3. CLI: scala-cli 4. Build tools: Mill 5. Libraries/Frameworks: scalatest, ZIO, Cats, Akka HTTP, Spark, Play, fs2, Slick, and more 6. Library Registry: https://index.scala-lang.org/ Scala Tools & Libraries
  5. Flink Forward 2023 © Scala Versions Scala 2.12 released on

    Oct 28, 2016 Scala 2.13 released on Jun 7, 2019 Scala 3.0 released on May 21, 2021 Binary compatible Flink Scala API is still on 2.12
  6. Flink Forward 2023 © Dependency Tree: before Flink 1.15 Scala

    is coupled Scala 2.11, 2.12 std. library Flink Modules in Java/Scala DataStream Scala/Java API Compile-time dependency Scala 2.13, 3.x std. library Flink Job in Scala Scala 2.11, 2.12 std. library Switch to new Scala is not possible Apache Flink modules User app modules implies
  7. Flink Forward 2023 © Dependency Tree: since Flink 1.15 Scala

    is no longer tightly coupled Shaded Scala 2.12 std. library Flink Modules in Java/Scala DataStream Java API Compile-time dependency Scala 2.13, 3.x std. library Flink Job in Scala Apache Flink modules User app modules Switch to newer Scala is possible
  8. Flink Forward 2023 © Since Flink 1.15 • Flink’s Scala

    version is “shaded” and does not clash with user’s Scala • To use Scala 2.13 or 3.x remove flink-scala JAR from the Flink distribution: • Then use Java API from your Scala code $ rm flink-dist/lib/flink-scala* @main def job = val env = StreamExecutionEnvironment.getExecutionEnvironment env .fromElements(1, 2, 3, 4, 5, 6) .filter(_ % 2 == 1).map(i => i * i).print() env.execute() However, users have to provide Scala serializers. See solution further
  9. Flink Forward 2023 © Flink PMCs Decision Background: attempt to

    add support of Scala 2.13 was failed (see FLINK-13414 Jira) 1. Users to develop in Scala further via Java API - Pros: freedom of choice of any Scala version - Cons: it requires to define your own serializers 2. All Flink Scala APIs are deprecated and will be removed in future Flink versions 3. Flink Internal Scala modules will be kept or rewritten in Java (if possible)
  10. Flink Forward 2023 © Official Scala API Extension Add special

    “import” for the DataStream API import org.apache.flink.api.scala._ object Main extends App { val env = ExecutionEnvironment.getExecutionEnvironment val text = env.fromElements( "To be, or not to be,--that is the question:--", "Whether 'tis nobler in the mind to suffer", "The slings and arrows of outrageous fortune", "Or to take arms against a sea of troubles,") val counts = text .flatMap(value => value.split("\\s+")) .map(value => (value,1)) .groupBy(0) .sum(1) counts.writeAsCsv("output.txt", "\n", " ") env.execute("Scala WordCount Example") } https://index.scala- lang.org/apache/flink/artifacts/flink- streaming-scala/1.17.1?binary- version=_2.12 2.12
  11. Flink Forward 2023 © Ways to use new Scala with

    Flink 1. flink-extended/flink-scala-api: a fork of Flink Scala bindings originally created by Findify (great effort of Roman Grebennikov) 2. ariskk/flink4s: Scala 3.x wrapper for Apache Flink 3. Direct* usage of Flink Java API "org.apache.flink" % "flink-streaming-java" % ”x.y.z” *Caution: you need to bring own type serializers
  12. Flink Forward 2023 © Migration to flink-scala-api // flink-scapa-api imports

    import org.apache.flinkx.api.* import org.apache.flinkx.api.serializers.* // original API import import org.apache.flink.streaming.api.scala.* libraryDependencies += "org.flinkextended" %% "flink-scala-api" % "1.16.2_1.1.0" "1.17.1_1.1.0" "1.15.4_1.1.0" // build.sbt Choose your version
  13. Flink Forward 2023 © Example Job (flink-extended/flink-scala-api) import org.apache.flinkx.api.* import

    org.apache.flinkx.api.serializers.* @main def socketWordCount(hostName: String, port: Int) = val env = StreamExecutionEnvironment.getExecutionEnvironment env .socketTextStream(hostName, port) .flatMap(_.toLowerCase.split("\\W+").filter(_.nonEmpty)) .map((_, 1)) .keyBy(_._1) .sum(1).print() env.execute("Scala socketWordCount Example") Connecting to server socket localhost:9999 [info] 3> (hello,1) [info] 8> (flink,1) [info] 1> (scala,1) [info] 1> (api,1) % nc –lk 9999 hello flink scala api
  14. Flink Forward 2023 © Serializer Derivation import org.apache.flinkx.api.serializers.* case class

    Foo(x: Int) { def inc(a: Int) = copy(x = x + a) } // defined explicitly for caching purpose on compilation. // If not defined, then it is derived automatically implicit lazy val fooTypeInfo: TypeInformation[Foo] = deriveTypeInformation[Foo] env .fromElements(Foo(1),Foo(2),Foo(3)) .map(x => x.inc(1)) // taken as an implicit .map(x => x.inc(2)) // again, no re-derivation
  15. Flink Forward 2023 © Main Features - A u t

    o m a t i c c o m p i l e - t i m e d e r i v a t i o n o f F l i n k s e r i a l i z e r s f o r s i m p l e S c a l a a n d A l g e b r a i c D a t a T y p e s - Z e r o r u n t i m e r e f l e c t i o n - N o s i l e n t f a l l b a c k t o K r y o s e r i a l i z a t i o n ( c o m p i l e e r r o r ) - E x t d e n d a b l e w i t h c u s t o m s e r i a l i z e r s f o r d e e p l y - n e s t e d t y p e s - E a s y t o m i g r a t e : m i m i c s o l d S c a l a A P I - S c a l a 3 s u p p o r t flink-extended/flink-scala-api
  16. Flink Forward 2023 © sbt assembly plugin To build a

    fat-jar: // project/plugins.sbt // build.sbt addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.0.0") lazy val root = (project in file(".")) .settings( // optionally define a main class in case there are multiple assembly / mainClass := Some("org.example.MyMainClass"), … ) > sbt assembly > ls target/scala-3*/*.jar target/scala-3.3.0/my-flink-project-0.1.jar
  17. Flink Forward 2023 © scala-cli It can compile, run, package

    and more multisetToString.scala: //> using scala "3" //> using dep "org.apache.flink:flink-table-api-java:1.15.4" import org.apache.flink.table.functions.ScalarFunction import org.apache.flink.table.annotation.DataTypeHint import java.util.{Map => JMap} class MultisetToString extends ScalarFunction: def eval( @DataTypeHint("MULTISET<INT>") mset: JMap[ Integer, String ] ) = mset.toString scala-cli package --jvm 11 \ multisetToString.scala \ -o udfs.jar \ --library -f J u s t o n e f i l e a n d s i n g l e c o m m a n d p a c k a g e s a U D F i n t o a J A R
  18. Flink Forward 2023 © Ammonite REPL A d d d

    e p e n d e n c i e s L o c a l m o d e R e s u l t See more at https://ammonite.io @ import $ivy.`org.flinkextended::flink-scala-api:1.16.2_1.0.0` @ import $ivy.`org.apache.flink:flink-clients:1.16.2` @ import org.apache.flinkx.api.* @ import org.apache.flinkx.api.serializers.* @ val env = StreamExecutionEnvironment.getExecutionEnvironment env: StreamExecutionEnvironment = org.apache.flink.api.StreamExecutionEnvironment@1e226bcd @ env.fromElements(1, 2, 3, 4, 5, 6).filter(_ % 2 == 1).map(i => i * i).print() res5: org.apache.flink.streaming.api.datastream.DataStreamSink[Int] = org.apache.flink.streaming.api.datastream.DataStreamSink@71e2c6d8 @ env.execute() 4> 1 8> 25 6> 9 res6: common.JobExecutionResult = Program execution finished Job with JobID 5a947a757f4e74c2a06dcfe80ba4fde8 has finished. Job Runtime: 345 ms
  19. Flink Forward 2023 © Jupyter Notebook with Scala kernel Jupyter+Almond

    provides similar user experience as Apache Zeppelin Almond A Scala kernel for Jupyter https://almond.sh/
  20. Flink Forward 2023 © Flink Job Template Install SBT first,

    then run: > sbt new novakov-alexey/flink-scala-api.g8 Above command generates “WordCount” Flink job in Scala 3 name [My Flink Scala Project]: new-flink-app flinkVersion [1.17.1]: // press enter to use 1.17.1 Template applied in /Users/myhome/dev/git/./new-flink-app new-flink-app ├── build.sbt ├── project │ └── build.properties └── src └── main └── scala └── com └── example └── WordCount.scala
  21. Flink Forward 2023 © Summary You can use latest Scala

    in your Flink jobs. There are 2 community wrappers available Scala eco-system provides better tools for Flink jobs development, debug and deployment: - Coursier, Scala-CLI, Ammonite, SBT, Scastie Large code-based in Scala remain maintainable unlike in Java FP paradigm allows to compose your Flink jobs easily Try to develop your next job with flink-scala-api More information on: https://www.scala-lang.org/ https://flink.apache.org/2022/02/22/scala-free-in-one-fifteen/ https://github.com/novakov-alexey/flink-sandbox