Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

DataFrame - a Swiss Army Knife of Java Data Pro...

DataFrame - a Swiss Army Knife of Java Data Processing

As Java developers, we do a lot of data processing. If you have terabytes pumped through your system daily, maybe you would reach for Spark, Flink or some other “big data” solution. But there are also many everyday tasks that do not warrant the complexity of traditional data pipelines. Some examples are analysis of app logs, cleaning up and persisting Excel files, simple ETL copying tables between different databases, etc. So, how can you use “big data” techniques without big data infrastructure?

This talk focuses on “DataFrame” - an in-memory table-like data structure with operations including column / row filtering and transformations, joins, aggregations, etc. I am using an open source DFLib library (https://dflib.org) and Jupyter notebook to demonstrate how to do data processing in any Java app without much fuss.

Avatar for Andrus Adamchik

Andrus Adamchik

June 13, 2024
Tweet

More Decks by Andrus Adamchik

Other Decks in Programming

Transcript

  1. How do we process data? I work with big data

    using Spark DataFrame in Python Dashboards with pandas DataFrame in Python, SQL and Jupyter I can do anything with Objects Lists and Streams
  2. Is Java Even Appropriate for Data Processing? Or am I

    missing something? 😵💫😕 Should I start learning Python? 🤔
  3. Apps are about working with data Java representation of data:

    “data object”, List, Set, Map record Product( String name, String color, double price) {} List<Product> products = List.of(..);
  4. CRUD - Create, Read, Update, Delete Reading / changing data

    without altering its structure => ORM, etc.
  5. Data transformation Map<String, Long> counts = products .stream() .map(Product::color) .collect(Collectors.groupingBy(

    Function.identity(), Collectors.counting())); Altering structure (“projection”, join, aggregation…) => streams, generic structures
  6. Data transformation • Gatherers are a step in the right

    direction • No joins • Java type system gets in the way of representing intermediate stream results: • Generic structures quickly become unwieldy => Map<A,List<Map<B,Set<C>>>> • Tuples are kinda missing (Map.Entry<A,B> ?) Missing lots of useful operations, inconvenient, wasteful data structures
  7. Why don't we model data as data The "other guys"

    are using tables (aka DataFrames), but not us. Can I have it too?
  8. DFLib - a Java DataFrame library • Open source at

    d fl ib.org • Core is dependency-free • Provides DataFrame object • Provides generic data operations
  9. Jupyter Notebook - Excel for Nerds $ jupyter lab Requires

    a Java kernel. DFLib provides one at dflib/jjava
  10. Performance - Memory Use i1 i2 ------ ------ 0 0

    1 1 2 2 ... 999997 999997 999998 999998 999999 999999 1_000_000 rows x 2 columns record IntRow(int i1, int i2) { } // 1_000_000 elements List<IntRow> list = ...
  11. Ideas for When to use DataFrames • Data exploration (understand

    what’s in the dataset) • Ad-hoc log analysis • Excel, CSV fi les (also Avro, Parquet, etc.) • “No drama” ETL
  12. Conclusions • DataFrame - a “Swiss Army Knife” of data

    processing • No special infrastructure required • Like SQL, but “composable” • Not a replacement of ORM, SQL, streams, but complementary to them • Notebooks and charts - another addition to the Java data toolbox
  13. Links • DFLib DataFrame and JJava Kernel: https://d fl ib.org/

    • Jupyter project: https://jupyter.org/ • Andrus's Twitter: @andrus_a