Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Flying Pandas - Modin, Dask and Vaex

ianozsvald
April 25, 2020

Flying Pandas - Modin, Dask and Vaex

10 min talk at Remote Pizza Python advising on when you might replace Pandas with Modin, Dask or Vaex for bigger-than-RAM and parallelised computation.

ianozsvald

April 25, 2020
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Flying Pandas - Dask, Modin and
    Vaex (live from London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    Remote Pizza Python 2020

    View full-size slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    – Higher Performance!
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition
    M
    ay
    2020

    View full-size slide


  3. When to use Modin or Dask

    A quick peek at Vaex
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  4. 10 million rows “probably fine” but needs 10s GB RAM

    Probably only single core, built for in-RAM computation

    Complex 10yr codebase, hard to optimise

    Following tools are Pandas-like (each with differences)
    When does Pandas get smelly?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  5. A new “algebra” for DataFrames,
    reimplemented functions & Pandas fallback

    Young project, drop-in replacement

    Uses Ray for parallel computation

    Easy to experiment with
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  6. Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  7. ex
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/modin-project/modin/issues/1390

    View full-size slide


  8. Mature project, Array (NumPy), Bag (list-like)

    Distributed dataframe for Pandas – row
    blocks, not cols
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://dask.readthedocs.io/en/latest/dataframe.html

    View full-size slide

  9. Dask – remember to “.compute()”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  10. Dask – mature & rich diagnostics
    By [ian]@ianozsvald[.com] Ian Ozsvald
    groupby task-graph

    View full-size slide


  11. “Slower” than Pandas but happily works for 100GBs+

    Lots of docs & help on StackOverflow

    Great for 1 or n machines for bigger-than-RAM tasks

    Give Workers lots of RAM (else they die!)
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  12. “New” project (not “Pandas”)

    Memory mapped, virtual columns & lazy computation

    New string dtype (RAM efficient)

    See article (single laptop, billions of samples) ->
    Vaex
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-
    minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

    View full-size slide


  13. Dask on Bigger Data, Modin if in
    RAM. See Budapest talk for smaller
    dataframes on Monday

    See blog for my classes

    I’d love a postcard if you learned
    something new
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald
    meetup.com/PyData-Budapest

    View full-size slide