DataFrame - a Swiss Army Knife of Java Data Processing

DataFrame a Swiss Army Knife of Java Data Processing by
Andrus Adamchik @andrus_a

About me...

Founder of ObjectStyle a bespoke software dev shop

Open Source Projects

Is Java a Good Choice for Data Processing?

Apps are about working with data Java representation of data:
“data object”, List, Set, Map record Product( String name, String color, double price) {} List<Product> products = List.of(..);

CRUD - Create, Read, Update, Delete Reading / changing data
without altering its structure => ORM, etc.

Data transformation Map<String, Long> counts = products .stream() .map(Product::color) .collect(Collectors.groupingBy(
Function.identity(), Collectors.counting())); Altering structure (“projection”, join, aggregation…) => streams, generic structures

Data transformation • No joins, etc. • Java type system
gets in the way of representing intermediate stream results: • Generic structures quickly become complex => Map<A,List<Map<B,Set<C>>>> • Tuples are kinda missing (Map.Entry<A,B> ?) Missing lots of useful operations, inconvenient data structures

Better abstraction is “table” select color, count(1) from product group
by color order by count(1) desc Used in: databases, but also R, pandas, Spark, Flink, etc.

DataFrame - a “missing” data structure Table is the data
object

DFLib - a Java DataFrame library • Open source at
d fl ib.org • Core is dependency-free • Provides DataFrame object • Provides generic data operations

Demo Environment - Jupyter Notebook $ jupyter lab Requires Java
kernel, e.g. dflib/jjava

Demo: ETL hockey games => team standings

How to choose? • CRUD (none or minimal transformation) =>
data objects / ORM • Simple transforms => Java streams • All data is in the same DB => SQL (respect “data locality”) • Data sets are very big, not easy to partition => Spark, Flink Objects/streams/ORM vs DataFrame vs Big Data vs SQL

When to use DataFrame? • Data fi ts in the
app memory • Complex transforms; streams produce nested generics, such as Map<A,List<Map<B,Set<C>>>> • You keep reimplementing custom join and group by over and over • Tabular data source and/or target • Mix of data sources (CSV, Excel, DB, JSON, etc.)

When to use DataFrame? Examples: • Data exploration (understand what’s
in the dataset) • Ad-hoc log analysis • Web app uploads Excel fi les • “No drama” ETL • … share your own (on GitHub under “Discussions”, or ping me on Twitter)

Conclusions • DataFrame - a “Swiss Army Knife” of data
processing • No special infrastructure required • Like SQL, but “composable” • Not a replacement of ORM, SQL, streams, but complementary to them • Notebooks and charts - another addition to the Java data toolbox

Links • DFLib: https://d fl ib.org/ | https://github.com/d fl ib/d
fl ib • Jupyter project: https://jupyter.org/ • JJava Jupyter Kernel: https://github.com/d fl ib/jjava • Install Jupyter on Mac: https://d fl ib.org/docs/1.x/#jupyter • My Twitter: @andrus_a Appreciate a GitHub “star”

DataFrame - a Swiss Army Knife of Java Data Pro...

DataFrame - a Swiss Army Knife of Java Data Processing

Andrus Adamchik

More Decks by Andrus Adamchik

Other Decks in Programming

Featured

Transcript