There is a mismatch between software engineering and data science. My talk addresses this fact, and tries to justify whether the use of Kotlin can help bring these two worlds closer together.
favourite topic among business managers and software engineers, when discussing company innovation strategies. — Tech media is only making it worse. — Data is everywhere, but getting useful knowledge is far different from what management and engineering imagine. @preslavrachev / (https://preslav.me), 2019 5
Money in. — ML (a.k.a sophisticated brute-force) is what gets the job done. — ML models are very limited to a given domain. — DS is the craft of finding which ML model works for a particular case, and which doesn't. ! @preslavrachev / (https://preslav.me), 2019 7
statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. — Wikipedia @preslavrachev / (https://preslav.me), 2019 8
from various sources 3. Clean, transform and unify 4. Extract features 5. Use the features to run a model 6. Visualize and report findings 7. Support or refute the hypothesis @preslavrachev / (https://preslav.me), 2019 9
science practices: — Software engineering works best when building well-defined systems — Requirements rarely change entirely, but evolve over time — Data science deals with supporting and refuting hypotheses. — Lots of uncertainty, which requires seamless exploration and visualisation @preslavrachev / (https://preslav.me), 2019 10
complex tech-stack mash- ups, where each side treats the other as some sort of a black box. - Difficult to maintain and requires lots of different skills and practices The question arises: Could there be a single tech stack that allows software engineers and data scientists work in peace, but also directly contribute to a single codebase? @preslavrachev / (https://preslav.me), 2019 11
with a fluent syntax — Strong community and enterprise backing — Access to the entire universe of JVM knowledge and libraries But there are a few important pieces, which are not quite there yet: — Fully integrated scripting capabilities — Playground environments (e.g. notebooks) — Data wrangling and visualization libraries that take advantage of the above @preslavrachev / (https://preslav.me), 2019 13
equally easy to learn by both Java and Python programmers. My personal Top 10: 1. Static typing 2. Immutability and Null-safety 3. Higher-order functions 4. Chain-able sequences 5. Data classes 6. Extension methods 7. Sealed classes 8. Coroutines 9. Default and named arguments 10. Multi-platform support @preslavrachev / (https://preslav.me), 2019 14
Java val nums = listOf(1, 2, 3) // Kotlin Immutability and Null-Safety // Every variable in Kotlin must be assigned a value, // unless explicitly declared with `lateinit` val x = 100 // cannot be changed, ever! var y = 200 // this one can lateinit var z // A value will be provided later @preslavrachev / (https://preslav.me), 2019 15
the job, without an abundant library ecosystem to choose and pick from. — The Kotlin Standard Library will be your first choice. Yet, by far not the only one. — Kotlin is stepping on the shoulders of giants (e.g. the JVM) — The future prospects of integrating low-level libraries together with Kotlin Native are even more promising @preslavrachev / (https://preslav.me), 2019 20
Batch Processing Apache Spark Data Streaming ND4J scientific computing (similar to NumPy) Apache Commons Math Math and computing utils Weka ML/NLP (similar to SciPy) Tablesaw Visualization (similar to Matplotlib and Plot.ly) TensorFlow for Java Deep ML Deeplearning4j Deep ML And many, many more... @preslavrachev / (https://preslav.me), 2019 21
Library Functionality Krangl Data wrangling (similar to Pandas) Kravis Visualisation (similar to Matplotlib and Plot.ly) Koma scientific computing (similar to SciPy) kotlin-statistics scientific computing and statistics komputation neural network for the JVM written in Kotlin and CUDA C Still, no real Pandas yet ! @preslavrachev / (https://preslav.me), 2019 22
data (e.g. CSV) or results (e.g. trained ML models, reports, aggregated data sets, visualisations, etc). — At the very least, this means that one can create an environment, in which data scientists keep using their favourite tools, and communicate their findings with the software engineers, using those standards. — Kotlin can become a mutual ground of code understanding OK, but can we get one step further from there? @preslavrachev / (https://preslav.me), 2019 23
data team involves the use and deployment of executable scripts. This is one field where Python excels off the charts Kotlin Script is unfinished, slow and painful to work with ! KEEP-75 @preslavrachev / (https://preslav.me), 2019 26
performance of Kotlin scripts, and reduce the friction when working with 3rd-part libs: #!/usr/bin/env kscript @file:DependsOn("de.mpicbg.scicomp:kutils:0.4") import de.mpicbg.scicomp.bioinfo.openFasta if (args.size != 1) { System.err.println("Usage: CountRecords <fasta>") kotlin.system.exitProcess(-1) } val records = openFasta(java.io.File(args[0])) println(records.count()) @preslavrachev / (https://preslav.me), 2019 27
a tough beast. IntelliJ extends the Kotlin REPL and makes it a bit nicer to work with. Check out KShell as an alternative. @preslavrachev / (https://preslav.me), 2019 28
Jupyter allow for a unique mix of narrative and code. — Let programmers play around with data and libs in a visual, REPL-like environment — Great for sharing and explaining difficult concepts Kotlin Jupyter Kotlin Playground @preslavrachev / (https://preslav.me), 2019 29
with a mature library ecosystem. — It lacks some of the tooling that data scientists need. — The community and JetBrains are working hard to fill the gaps. — We wouldn't have reached this far, it weren't for these folks: @ligee, @thomasnield9727, @holgerbrandl and many more around the #datascience community on Slack. @preslavrachev / (https://preslav.me), 2019 30