Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Pandas Fly

Making Pandas Fly

Talk at PyDataBudapest on preprocessing data with Dask and then making Pandas run much faster and use less RAM: https://ianozsvald.com/2020/04/27/flying-pandas-and-making-pandas-fly-virtual-talks-this-weekend-on-faster-data-processing-with-pandas-modin-dask-and-vaex/

ianozsvald

April 27, 2020
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Making Pandas Fly (live from
    London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataBudapest 2020

    View full-size slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    – Higher Performance!
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition
    M
    ay
    2020

    View full-size slide


  3. Preparing data with Dask

    Pandas
    – Saving RAM
    – Calculating faster by dropping to Numpy & Numba

    What does Covid 19 do to the (UK) economy?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  4. 10 million rows “probably fine” but needs 10s GB RAM

    Probably only single core, built for in-RAM computation

    Complex 10yr codebase, hard to optimise
    When does Pandas get smelly?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  5. Mature project, Array (NumPy), Bag (list-like)

    Distributed dataframe for Pandas – row
    blocks, not cols
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://dask.readthedocs.io/en/latest/dataframe.html

    View full-size slide

  6. Dask – remember to “.compute()”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  7. Dask – mature & rich diagnostics
    By [ian]@ianozsvald[.com] Ian Ozsvald
    groupby task-graph

    View full-size slide


  8. Live dataframe demo...
    Dask Demo
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  9. “Slower” than Pandas but happily works for 100GBs+

    Lots of docs & help on StackOverflow

    Great for 1 or n machines for bigger-than-RAM tasks

    Give Workers lots of RAM (else they die!)
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  10. Get more into RAM with smaller dtypes

    Use smaller dtypes for numbers

    Make smarter function choices

    Compile with Numba
    Making Pandas fly
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  11. NumPy vs Pandas overhead
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25 files, 83 functions
    Very few NumPy
    calls!

    View full-size slide

  12. Overhead...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  13. Overhead with ser.values.sum()
    By [ian]@ianozsvald[.com] Ian Ozsvald
    18 files, 51 functions
    Many fewer Pandas
    calls (but still a lot!)

    View full-size slide

  14. Covid 19 UK economic impact?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  15. Dask on Bigger Data, Modin if in
    RAM. Vaex for many strings

    See blog for my classes

    I’d love a postcard if you learned
    something new
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  16. “New” project (not “Pandas”)

    Memory mapped, virtual columns & lazy computation

    New string dtype (RAM efficient)

    See article (single laptop, billions of samples) ->
    Vaex
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-
    minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

    View full-size slide


  17. A new “algebra” for DataFrames,
    reimplemented functions & Pandas fallback

    Young project, drop-in replacement

    Uses Ray for parallel computation

    Easy to experiment with
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  18. Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  19. ex
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/modin-project/modin/issues/1390

    View full-size slide