Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kotlin ❤️ Data Science?

Kotlin ❤️ Data Science?

There is a mismatch between software engineering and data science. My talk addresses this fact, and tries to justify whether the use of Kotlin can help bring these two worlds closer together.

Preslav Rachev

January 28, 2019
Tweet

More Decks by Preslav Rachev

Other Decks in Technology

Transcript

  1. Kotlin

    Data Science?*
    Preslav Rachev @ KI labs // 28.01.2019
    * Data science and data engineering
    @preslavrachev / (https://preslav.me), 2019 1

    View Slide

  2. Who am I?
    — A software engineer, working at KI
    labs.
    — Passionate about Kotlin and data.
    — A genuinely curious individual who
    loves writing.

    Also, an inventor of funny faces
    ! "

    https://preslav.me
    @preslavrachev / (https://preslav.me), 2019 2

    View Slide

  3. @preslavrachev / (https://preslav.me), 2019 3

    View Slide

  4. KotlinConf 2018
    @preslavrachev / (https://preslav.me), 2019 4

    View Slide

  5. The IT Reality of 2019
    — "AI" has become a favourite topic
    among business managers and
    software engineers, when discussing
    company innovation strategies.
    — Tech media is only making it worse.
    — Data is everywhere, but getting useful
    knowledge is far different from what
    management and engineering
    imagine.
    @preslavrachev / (https://preslav.me), 2019 5

    View Slide

  6. AI, ML, DS?!?
    @preslavrachev / (https://preslav.me), 2019 6

    View Slide

  7. AI, ML, DS?!?
    — AI is what brings the VC Money in.
    — ML (a.k.a sophisticated brute-force) is
    what gets the job done.
    — ML models are very limited to a
    given domain.
    — DS is the craft of finding which ML
    model works for a particular case, and
    which doesn't.
    !
    @preslavrachev / (https://preslav.me), 2019 7

    View Slide

  8. Data Science Definition
    Data science is a "concept to unify statistics, data analysis, machine learning
    and their related methods" in order to "understand and analyze actual
    phenomena" with data. It employs techniques and theories drawn from
    many fields within the context of mathematics, statistics, information
    science, and computer science.
    — Wikipedia
    @preslavrachev / (https://preslav.me), 2019 8

    View Slide

  9. Data Science Workflow
    1. Form a hypothesis
    2. Load data from various sources
    3. Clean, transform and unify
    4. Extract features
    5. Use the features to run a model
    6. Visualize and report findings
    7. Support or refute the hypothesis
    @preslavrachev / (https://preslav.me), 2019 9

    View Slide

  10. Motivation
    There is a mismatch between software engineering and data science
    practices:
    — Software engineering works best when building well-defined
    systems
    — Requirements rarely change entirely, but evolve over time
    — Data science deals with supporting and refuting hypotheses.
    — Lots of uncertainty, which requires seamless exploration and
    visualisation
    @preslavrachev / (https://preslav.me), 2019 10

    View Slide

  11. Motivation
    Due to this mismatch, systems often end
    up becoming complex tech-stack mash-
    ups, where each side treats the other as
    some sort of a black box.
    - Difficult to maintain and requires lots of
    different skills and practices
    The question arises:
    Could there be a single tech stack that
    allows software engineers and data
    scientists work in peace, but also directly
    contribute to a single codebase?
    @preslavrachev / (https://preslav.me), 2019 11

    View Slide

  12. @preslavrachev / (https://preslav.me), 2019 12

    View Slide

  13. Kotlin == the missing link?
    — A multi-paradigm programming language with a fluent syntax
    — Strong community and enterprise backing
    — Access to the entire universe of JVM knowledge and libraries
    But there are a few important pieces, which are not quite there yet:
    — Fully integrated scripting capabilities
    — Playground environments (e.g. notebooks)
    — Data wrangling and visualization libraries that take advantage of the above
    @preslavrachev / (https://preslav.me), 2019 13

    View Slide

  14. Kotlin Top 10 Features
    Kotlin is a multi-paradigm programming language, equally easy to learn
    by both Java and Python programmers.
    My personal Top 10:
    1. Static typing
    2. Immutability and Null-safety
    3. Higher-order functions
    4. Chain-able sequences
    5. Data classes
    6. Extension methods
    7. Sealed classes
    8. Coroutines
    9. Default and named arguments
    10. Multi-platform support
    @preslavrachev / (https://preslav.me), 2019 14

    View Slide

  15. Static Typing
    List nums = new List<>(Arrays.asList(1, 2, 3)); // Java
    val nums = listOf(1, 2, 3) // Kotlin
    Immutability and Null-Safety
    // Every variable in Kotlin must be assigned a value,
    // unless explicitly declared with `lateinit`
    val x = 100 // cannot be changed, ever!
    var y = 200 // this one can
    lateinit var z // A value will be provided later
    @preslavrachev / (https://preslav.me), 2019 15

    View Slide

  16. Higher-Order Functions and DSL support
    A higher-order function is a function that takes functions as parameters, or
    returns a function.
    route("/portal") {
    route("articles") { … }
    route("admin") {
    intercept(ApplicationCallPipeline.Features) { … } // verify admin privileges
    route("article/{id}") { … } // manage article with {id}
    route("profile/{id}") { … } // manage profile with {id}
    }
    }
    @preslavrachev / (https://preslav.me), 2019 16

    View Slide

  17. Data classes and Chain-able Sequences
    data class Person(val name: String, val age: Int)
    val people =
    listOf(Person("Chris Martin", 31),
    Person("Will Champion", 32),
    Person("Jonny Buckland", 33),
    Person("Guy Berryman", 34),
    Person("Mhris Cartin", 30))
    println(people
    .asSequence() // convert to sequence
    .filter { it.age > 30 } // lazy eval (intermediate op)
    .map {
    it.name.split(" ").map {it[0]}.joinToString("")
    } // lazy eval (intermediate op)
    .map { it.toUpperCase() } // lazy eval (intermediate op)
    .toList()) // terminal operation
    Tip: Combine these with coroutines to construct declarative data pipelines.
    @preslavrachev / (https://preslav.me), 2019 17

    View Slide

  18. Sealed Classes
    sealed class ArithmeticOperation
    class Add(var a: Int, var b: Int): ArithmeticOperation()
    class Subtract(var a: Int, var b: Int): ArithmeticOperation()
    class Multiply(var a: Int, var b: Int): ArithmeticOperation()
    class Divide(var a: Int, var b: Int): ArithmeticOperation()
    fun execute(op: ArithmeticOperation) = when (op) {
    is Add -> op.a + op.b
    is Subtract -> op.a - op.b
    is Multiply -> op.a * op.b
    is Divide -> op.a / op.b
    }
    @preslavrachev / (https://preslav.me), 2019 18

    View Slide

  19. Extension Methods
    fun String.underscore() : String {
    return this
    .replace(" ", "_")
    }
    print("hello word".underscore()) // "hello_world"
    Infix support
    infix fun Number.toPowerOf(exponent: Number): Double {
    return Math.pow(this.toDouble(), exponent.toDouble())
    }
    3 toPowerOf 2 // 9
    9 toPowerOf 0.5 // 3
    @preslavrachev / (https://preslav.me), 2019 19

    View Slide

  20. The Ecosystem
    No programming language in the world will do the job, without an
    abundant library ecosystem to choose and pick from.
    — The Kotlin Standard Library will be your first choice. Yet, by far not the
    only one.
    — Kotlin is stepping on the shoulders of giants (e.g. the JVM)
    — The future prospects of integrating low-level libraries together with
    Kotlin Native are even more promising
    @preslavrachev / (https://preslav.me), 2019 20

    View Slide

  21. What the JVM has to offer...
    Library Functionality
    Apache Hadoop Batch Processing
    Apache Spark Data Streaming
    ND4J scientific computing (similar to NumPy)
    Apache Commons Math Math and computing utils
    Weka ML/NLP (similar to SciPy)
    Tablesaw Visualization (similar to Matplotlib and Plot.ly)
    TensorFlow for Java Deep ML
    Deeplearning4j Deep ML
    And many, many more...
    @preslavrachev / (https://preslav.me), 2019 21

    View Slide

  22. ...besides, a young ecosystem of libs targeting Kotlin's unique
    features:
    Library Functionality
    Krangl Data wrangling (similar to Pandas)
    Kravis Visualisation (similar to Matplotlib and Plot.ly)
    Koma scientific computing (similar to SciPy)
    kotlin-statistics scientific computing and statistics
    komputation neural network for the JVM written in Kotlin and CUDA C
    Still, no real Pandas yet
    !
    @preslavrachev / (https://preslav.me), 2019 22

    View Slide

  23. Many of the above libraries use standards for communicating input data
    (e.g. CSV) or results (e.g. trained ML models, reports, aggregated data
    sets, visualisations, etc).
    — At the very least, this means that one can create an environment, in
    which data scientists keep using their favourite tools, and
    communicate their findings with the software engineers, using those
    standards.
    — Kotlin can become a mutual ground of code understanding
    OK, but can we get one step further from there?
    @preslavrachev / (https://preslav.me), 2019 23

    View Slide

  24. @preslavrachev / (https://preslav.me), 2019 24

    View Slide

  25. The ultimate data scientist peace
    requires three more things:
    1. Better Kotlin scripting support
    2. A solid REPL (Read-Eval-Print Loop) console
    3. Tools that encourage experimentation and interactive programming
    @preslavrachev / (https://preslav.me), 2019 25

    View Slide

  26. Scripting Support
    A large portion of the work of the data
    team involves the use and deployment of
    executable scripts.
    This is one field where Python excels off
    the charts
    Kotlin Script is unfinished, slow and
    painful to work with
    !
    KEEP-75
    @preslavrachev / (https://preslav.me), 2019 26

    View Slide

  27. KScript
    Is an open-source project that tries to improve the performance of Kotlin
    scripts, and reduce the friction when working with 3rd-part libs:
    #!/usr/bin/env kscript
    @file:DependsOn("de.mpicbg.scicomp:kutils:0.4")
    import de.mpicbg.scicomp.bioinfo.openFasta
    if (args.size != 1) {
    System.err.println("Usage: CountRecords ")
    kotlin.system.exitProcess(-1)
    }
    val records = openFasta(java.io.File(args[0]))
    println(records.count())
    @preslavrachev / (https://preslav.me), 2019 27

    View Slide

  28. REPL
    Kotlin has a REPL (Read-Eval-Print Loop), but it is a tough beast.
    IntelliJ extends the Kotlin REPL and makes it a bit nicer to work with.
    Check out KShell as an alternative.
    @preslavrachev / (https://preslav.me), 2019 28

    View Slide

  29. Interactive Programming
    Also known as notebooks or playgrounds,
    tools like Jupyter allow for a unique mix of
    narrative and code.
    — Let programmers play around with
    data and libs in a visual, REPL-like
    environment
    — Great for sharing and explaining
    difficult concepts
    Kotlin Jupyter
    Kotlin Playground
    @preslavrachev / (https://preslav.me), 2019 29

    View Slide

  30. What did we learn?
    — Kotlin is a great language with a mature library ecosystem.
    — It lacks some of the tooling that data scientists need.
    — The community and JetBrains are working hard to fill the gaps.
    — We wouldn't have reached this far, it weren't for these folks:
    @ligee, @thomasnield9727, @holgerbrandl and many more
    around the #datascience community on Slack.
    @preslavrachev / (https://preslav.me), 2019 30

    View Slide

  31. Thank You!
    !
    Short Demo
    Questions?
    @preslavrachev / (https://preslav.me), 2019 31

    View Slide

  32. Links
    — The Connection Between Data Science, Machine Learning and
    Artificial Intelligence
    — Awesome Kotlin - a curated list of libraries and resources
    @preslavrachev / (https://preslav.me), 2019 32

    View Slide

  33. Why Kotlin and not Scala?
    @preslavrachev / (https://preslav.me), 2019 33

    View Slide