Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DataFrame - a Swiss Army Knife of Java Data Pro...

DataFrame - a Swiss Army Knife of Java Data Processing

As Java developers, we do a lot of data processing. If you have terabytes pumped through your system daily, maybe you would reach for Spark, Flink or some other “big data” solution. But there are also many everyday tasks that do not warrant the complexity of traditional data pipelines. Some examples are analysis of app logs, cleaning up and persisting Excel files, simple ETL copying tables between different databases, etc. So, how can you use “big data” techniques without big data infrastructure?

This talk focuses on “DataFrame” - an in-memory table-like data structure with operations including column / row filtering and transformations, joins, aggregations, etc. I am using an open source DFLib library (https://dflib.org) and Jupyter notebook to demonstrate how to do data processing in any Java app without much fuss.

Avatar for Andrus Adamchik

Andrus Adamchik

June 13, 2024
Tweet

More Decks by Andrus Adamchik

Other Decks in Programming

Transcript

  1. Apps are about working with data Java representation of data:

    “data object”, List, Set, Map record Product( String name, String color, double price) {} List<Product> products = List.of(..);
  2. CRUD - Create, Read, Update, Delete Reading / changing data

    without altering its structure => ORM, etc.
  3. Data transformation Map<String, Long> counts = products .stream() .map(Product::color) .collect(Collectors.groupingBy(

    Function.identity(), Collectors.counting())); Altering structure (“projection”, join, aggregation…) => streams, generic structures
  4. Data transformation • No joins, etc. • Java type system

    gets in the way of representing intermediate stream results: • Generic structures quickly become complex => Map<A,List<Map<B,Set<C>>>> • Tuples are kinda missing (Map.Entry<A,B> ?) Missing lots of useful operations, inconvenient data structures
  5. Better abstraction is “table” select color, count(1) from product group

    by color order by count(1) desc Used in: databases, but also R, pandas, Spark, Flink, etc.
  6. DFLib - a Java DataFrame library • Open source at

    d fl ib.org • Core is dependency-free • Provides DataFrame object • Provides generic data operations
  7. How to choose? • CRUD (none or minimal transformation) =>

    data objects / ORM • Simple transforms => Java streams • All data is in the same DB => SQL (respect “data locality”) • Data sets are very big, not easy to partition => Spark, Flink Objects/streams/ORM vs DataFrame vs Big Data vs SQL
  8. When to use DataFrame? • Data fi ts in the

    app memory • Complex transforms; streams produce nested generics, such as Map<A,List<Map<B,Set<C>>>> • You keep reimplementing custom join and group by over and over • Tabular data source and/or target • Mix of data sources (CSV, Excel, DB, JSON, etc.)
  9. When to use DataFrame? Examples: • Data exploration (understand what’s

    in the dataset) • Ad-hoc log analysis • Web app uploads Excel fi les • “No drama” ETL • … share your own (on GitHub under “Discussions”, or ping me on Twitter)
  10. Conclusions • DataFrame - a “Swiss Army Knife” of data

    processing • No special infrastructure required • Like SQL, but “composable” • Not a replacement of ORM, SQL, streams, but complementary to them • Notebooks and charts - another addition to the Java data toolbox
  11. Links • DFLib: https://d fl ib.org/ | https://github.com/d fl ib/d

    fl ib • Jupyter project: https://jupyter.org/ • JJava Jupyter Kernel: https://github.com/d fl ib/jjava • Install Jupyter on Mac: https://d fl ib.org/docs/1.x/#jupyter • My Twitter: @andrus_a Appreciate a GitHub “star”