Kotlin ❤️ Data Science?

Kotlin ❤️ Data Science?

There is a mismatch between software engineering and data science. My talk addresses this fact, and tries to justify whether the use of Kotlin can help bring these two worlds closer together.

Preslav Rachev

January 28, 2019

  Kotlin ❤ Data Science?* Preslav Rachev @ KI labs //

    28.01.2019 * Data science and data engineering
  Who am I? — A software engineer, working at KI

    labs. — Passionate about Kotlin and data. — A genuinely curious individual who loves writing. ✏ Also, an inventor of funny faces ! " ✏ https://preslav.me
  The IT Reality of 2019 — "AI" has become a

    favourite topic among business managers and software engineers, when discussing company innovation strategies. — Tech media is only making it worse. — Data is everywhere, but getting useful knowledge is far different from what management and engineering imagine.
  AI, ML, DS?!? — AI is what brings the VC

    Money in. — ML (a.k.a sophisticated brute-force) is what gets the job done. — ML models are very limited to a given domain. — DS is the craft of finding which ML model works for a particular case, and which doesn't. !
  Data Science Definition Data science is a "concept to unify

    statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. — Wikipedia
  Data Science Workflow 1. Form a hypothesis 2. Load data

    from various sources 3. Clean, transform and unify 4. Extract features 5. Use the features to run a model 6. Visualize and report findings 7. Support or refute the hypothesis
  Motivation There is a mismatch between software engineering and data

    science practices: — Software engineering works best when building well-defined systems — Requirements rarely change entirely, but evolve over time — Data science deals with supporting and refuting hypotheses. — Lots of uncertainty, which requires seamless exploration and visualisation
  Motivation Due to this mismatch, systems often end up becoming

    complex tech-stack mash- ups, where each side treats the other as some sort of a black box. - Difficult to maintain and requires lots of different skills and practices The question arises: Could there be a single tech stack that allows software engineers and data scientists work in peace, but also directly contribute to a single codebase?
  Kotlin == the missing link? — A multi-paradigm programming language

    with a fluent syntax — Strong community and enterprise backing — Access to the entire universe of JVM knowledge and libraries But there are a few important pieces, which are not quite there yet: — Fully integrated scripting capabilities — Playground environments (e.g. notebooks) — Data wrangling and visualization libraries that take advantage of the above
  Kotlin Top 10 Features Kotlin is a multi-paradigm programming language,

    equally easy to learn by both Java and Python programmers. My personal Top 10: 1. Static typing 2. Immutability and Null-safety 3. Higher-order functions 4. Chain-able sequences 5. Data classes 6. Extension methods 7. Sealed classes 8. Coroutines 9. Default and named arguments 10. Multi-platform support
  Static Typing List<Integer> nums = new List<>(Arrays.asList(1, 2, 3)); //

    Java val nums = listOf(1, 2, 3) // Kotlin Immutability and Null-Safety // Every variable in Kotlin must be assigned a value, // unless explicitly declared with `lateinit` val x = 100 // cannot be changed, ever! var y = 200 // this one can lateinit var z // A value will be provided later
  Higher-Order Functions and DSL support A higher-order function is a

    function that takes functions as parameters, or returns a function. route("/portal") { route("articles") { … } route("admin") { intercept(ApplicationCallPipeline.Features) { … } // verify admin privileges route("article/{id}") { … } // manage article with {id} route("profile/{id}") { … } // manage profile with {id} } }
  Data classes and Chain-able Sequences data class Person(val name: String,

    val age: Int) val people = listOf(Person("Chris Martin", 31), Person("Will Champion", 32), Person("Jonny Buckland", 33), Person("Guy Berryman", 34), Person("Mhris Cartin", 30)) println(people .asSequence() // convert to sequence .filter { it.age > 30 } // lazy eval (intermediate op) .map { it.name.split(" ").map {it[0]}.joinToString("") } // lazy eval (intermediate op) .map { it.toUpperCase() } // lazy eval (intermediate op) .toList()) // terminal operation Tip: Combine these with coroutines to construct declarative data pipelines.
  Sealed Classes sealed class ArithmeticOperation class Add(var a: Int, var

    b: Int): ArithmeticOperation() class Subtract(var a: Int, var b: Int): ArithmeticOperation() class Multiply(var a: Int, var b: Int): ArithmeticOperation() class Divide(var a: Int, var b: Int): ArithmeticOperation() fun execute(op: ArithmeticOperation) = when (op) { is Add -> op.a + op.b is Subtract -> op.a - op.b is Multiply -> op.a * op.b is Divide -> op.a / op.b }
  Extension Methods fun String.underscore() : String { return this .replace("

    ", "_") } print("hello word".underscore()) // "hello_world" Infix support infix fun Number.toPowerOf(exponent: Number): Double { return Math.pow(this.toDouble(), exponent.toDouble()) } 3 toPowerOf 2 // 9 9 toPowerOf 0.5 // 3
  The Ecosystem No programming language in the world will do

    the job, without an abundant library ecosystem to choose and pick from. — The Kotlin Standard Library will be your first choice. Yet, by far not the only one. — Kotlin is stepping on the shoulders of giants (e.g. the JVM) — The future prospects of integrating low-level libraries together with Kotlin Native are even more promising
  What the JVM has to offer... Library Functionality Apache Hadoop

    Batch Processing Apache Spark Data Streaming ND4J scientific computing (similar to NumPy) Apache Commons Math Math and computing utils Weka ML/NLP (similar to SciPy) Tablesaw Visualization (similar to Matplotlib and Plot.ly) TensorFlow for Java Deep ML Deeplearning4j Deep ML And many, many more...
  ...besides, a young ecosystem of libs targeting Kotlin's unique features:

    Library Functionality Krangl Data wrangling (similar to Pandas) Kravis Visualisation (similar to Matplotlib and Plot.ly) Koma scientific computing (similar to SciPy) kotlin-statistics scientific computing and statistics komputation neural network for the JVM written in Kotlin and CUDA C Still, no real Pandas yet !
  Many of the above libraries use standards for communicating input

    data (e.g. CSV) or results (e.g. trained ML models, reports, aggregated data sets, visualisations, etc). — At the very least, this means that one can create an environment, in which data scientists keep using their favourite tools, and communicate their findings with the software engineers, using those standards. — Kotlin can become a mutual ground of code understanding OK, but can we get one step further from there?
  The ultimate data scientist peace requires three more things: 1.

    Better Kotlin scripting support 2. A solid REPL (Read-Eval-Print Loop) console 3. Tools that encourage experimentation and interactive programming
  Scripting Support A large portion of the work of the

    data team involves the use and deployment of executable scripts. This is one field where Python excels off the charts Kotlin Script is unfinished, slow and painful to work with ! KEEP-75
  KScript Is an open-source project that tries to improve the

    performance of Kotlin scripts, and reduce the friction when working with 3rd-part libs: #!/usr/bin/env kscript @file:DependsOn("de.mpicbg.scicomp:kutils:0.4") import de.mpicbg.scicomp.bioinfo.openFasta if (args.size != 1) { System.err.println("Usage: CountRecords <fasta>") kotlin.system.exitProcess(-1) } val records = openFasta(java.io.File(args[0])) println(records.count())
  REPL Kotlin has a REPL (Read-Eval-Print Loop), but it is

    a tough beast. IntelliJ extends the Kotlin REPL and makes it a bit nicer to work with. Check out KShell as an alternative.
  Interactive Programming Also known as notebooks or playgrounds, tools like

    Jupyter allow for a unique mix of narrative and code. — Let programmers play around with data and libs in a visual, REPL-like environment — Great for sharing and explaining difficult concepts Kotlin Jupyter Kotlin Playground
  What did we learn? — Kotlin is a great language

    with a mature library ecosystem. — It lacks some of the tooling that data scientists need. — The community and JetBrains are working hard to fill the gaps. — We wouldn't have reached this far, it weren't for these folks: @ligee, @thomasnield9727, @holgerbrandl and many more around the #datascience community on Slack.
  Links — The Connection Between Data Science, Machine Learning and

    Artificial Intelligence — Awesome Kotlin - a curated list of libraries and resources