$30 off During Our Annual Pro Sale. View Details »

Pandas 2, Polars and Dask (PyDataLondon 2023)

ianozsvald
June 03, 2023

Pandas 2, Polars and Dask (PyDataLondon 2023)

Pandas 2 brings new Arrow data types, faster calculations and better scalability. Dask scales Pandas across cores. Polars is a new competitor to Pandas designed around Arrow with native multicore support. Which should you choose for modern research workflows? We'll solve a "just about fits in ram" data task using the 3 solutions, talking about the pros and cons so you can make the best choice for your research workflow. You'll leave with a clear idea of whether Pandas 2, Dask or Polars is the tool for your team to invest in.
https://london2023.pydata.org/cfp/talk/D7HGQL/

ianozsvald

June 03, 2023
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Pandas 2, Polars or Dask?
    PyData London 2023 Talk
    @IanOzsvald – ianozsvald.com
    @GilesWeaver

    View Slide

  2. Interim Chief Data Scientist
    We are Ian Ozsvald & Giles Weaver
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    2nd
    Edition!
    Data Scientist

    View Slide

  3. Lots of change in the ecosystem in recent years
    Which library should you use?
    We learned Polars in 2 weeks
    We benchmark. All benchmarks are lies
    3 interesting DataFrame libraries
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  4. Ian - “Let’s do something silly”
    2,000 mile round trip <£1k car
    Ideally it shouldn’t break or
    explode
    Motoscape Charity Rally
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    https://bit.ly/JustGivingIan

    View Slide

  5. 17 years of roadtest pass or fails
    30M vehicles/year, [C|T]SV text files
    Text→Parquet made easy with Dask
    600M rows in total
    Car Test Data (UK DVLA)
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  6. Pandas 15 years old, NumPy based
    PyArrow first class alongside NumPy
    Internal clean-ups so less RAM used
    Copy on Write (off by default)
    Pandas 2 – what’s new?
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  7. PyArrow vs NumPy – which to use?
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    NumExpr & bottleneck both installed
    Checks for identical results in notebook
    String dtype Nullable integer dtype
    Backend
    NumPy strings expensive in RAM e.g.
    82M rows 39GB NumPy, 11GB Arrow

    View Slide

  8. Pandas+Arrow, query, Seaborn
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    You can optimise by hand – mask, then
    choose columns to go faster

    View Slide

  9. Rust based, Python front-end, 3 years old
    Arrow (not NumPy)
    Inherently multi-core and parallelised
    Eager and Lazy API (+Query Planner)
    Beta out-of-core (medium data) support
    Polars – what’s in it?
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  10. Polars – same query & Seaborn
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    (Lazy df is even faster)

    View Slide

  11. A more advanced query
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    Polars eager (no “lazy() / collect()” call) takes 6s
    Pandas+NumPy takes 25s (i.e. slower)
    Possibly we can further
    optimise this by hand (?)
    Enables the Query Planner
    optimisations

    View Slide

  12. Pandas+Arrow probably faster than Pandas+NumPy
    Polars seems to be faster than Pandas+Arrow
    Maybe you can make Pandas “as fast”, but you have to
    experiment – Polars is “just fast”
    All benchmarks are lies – your mileage will vary
    First conclusions
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  13. BULLET
    Should I buy Volvo V50 – mileage?
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  14. BULLET
    Volvo v50 lasts <24 hours
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    https://bit.ly/JustGivingIan

    View Slide

  15. Resampling a timeseries
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    This dataset is in-RAM (2021-2022)
    There’s a limit to how much we
    can instantiate into memory,
    even if we’re careful with sub-
    selection and dtypes

    View Slide

  16. BULLET
    Scanning 640M rows of larger dataset
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    Implicit Lazy
    DataFrame
    11 seconds, 640M rows, circa 850 partitions (files)

    View Slide

  17. April drop
    was due to
    lockdown
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  18. Vehicle
    ownership
    increases,
    Hybrids
    growing
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    We have to touch all
    parquet files, so we
    can’t easily use
    Pandas
    MOT after 3 years of
    age for all vehicles

    View Slide

  19. Dask scales Pandas (and lots more)
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  20. By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    Adding columns=[...] saves 20s

    View Slide

  21. For the rally we bought a ‘99 Passat
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    Dead before 2023 Still alive
    Us
    https://bit.ly/JustGivingIan

    View Slide

  22. TITLE
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    3min+ with default 4 workers (*4 threads)
    1min with 12 workers (*1 thr.) hand tuned
    Giles had to push directives to
    the Arrow read, set shuffle on
    set_index and agg

    View Slide

  23. Issues encountered
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  24. Haven’t checked to_numpy(), Numba, apply, rolling,
    writing partitioned Parquet (Polars)
    NaN / Missing behaviour different Polars/Pandas
    sklearn partial support (sklearn assumes Pandas API)
    – but maybe Pandas+Arrow has copy issues too?
    Arrow timeseries/str different to Pandas NumPy?
    Thoughts on our testing
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  25. Dask ddf and Polars can perform similarly
    Dask learning curve harder, especially for performance
    Dask does a lot more (e.g. Bag, ML, NumPy, clusters,
    diagnostics)
    Medium-data conclusions
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

    View Slide

  26. Experiment, we have options!
    I love receiving postcards (email me)
    Sponsor Ian (JustGiving)
    Join after for our discussion
    Summary
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    https://bit.ly/JustGivingIan

    View Slide

  27. Appendix
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  28. Arrow RAM usage great!
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    Polars is similar
    as it uses Arrow

    View Slide

  29. Manual Query Planning
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  30. TITLE
    By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald
    3min+ with default 4 workers (*4 threads)
    1min with 12 works (*1 thread) – hand tuned
    Giles had to sort the Parquet
    (6 mins) & change groupby
    agg shuffle, else performance
    much worse

    View Slide