Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyDataUK_Making_Pandas_Fly.pdf

 PyDataUK_Making_Pandas_Fly.pdf

ianozsvald

May 05, 2020
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Making Pandas Fly (live from
    London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataUK 2020

    View Slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    – Higher Performance!
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition
    M
    ay
    2020

    View Slide


  3. Pandas
    – Saving RAM
    – Calculating faster by dropping to Numpy & Numba

    A brief look at Modin for in-RAM faster Pandas ops

    What does Covid 19 do to the (UK) economy?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  4. NumPy vs Pandas overhead
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25 files, 83 functions
    Very few NumPy
    calls!

    View Slide

  5. Overhead...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  6. Overhead with ser.values.sum()
    By [ian]@ianozsvald[.com] Ian Ozsvald
    18 files, 51 functions
    Many fewer Pandas
    calls (but still a lot!)

    View Slide


  7. A new “algebra” for DataFrames,
    reimplemented functions & Pandas fallback

    Young project, drop-in replacement

    Uses Ray for parallel computation

    Easy to experiment with
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  9. ex
    Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/modin-project/modin/issues/1390

    View Slide

  10. Covid 19 UK economic impact?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  11. Modin if big df, else check your
    Pandas choices. Swifter multicore

    See blog for my classes, also
    Thoughts & Jobs email list

    I’d love a postcard if you learned
    something new
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  12. “New” project (not “Pandas”)

    Memory mapped, virtual columns & lazy computation

    New string dtype (RAM efficient)

    See article (single laptop, billions of samples) ->
    Vaex
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-
    minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

    View Slide


  13. 10 million rows “probably fine” but needs 10s GB RAM

    Probably only single core, built for in-RAM computation

    Complex 10yr codebase, hard to optimise
    When does Pandas get smelly?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  14. Mature project, Array (NumPy), Bag (list-like)

    Distributed dataframe for Pandas – row
    blocks, not cols
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://dask.readthedocs.io/en/latest/dataframe.html

    View Slide

  15. Dask – remember to “.compute()”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  16. Dask – mature & rich diagnostics
    By [ian]@ianozsvald[.com] Ian Ozsvald
    groupby task-graph

    View Slide


  17. Live dataframe demo...
    Dask Demo
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  18. “Slower” than Pandas but happily works for 100GBs+

    Lots of docs & help on StackOverflow

    Great for 1 or n machines for bigger-than-RAM tasks

    Give Workers lots of RAM (else they die!)
    Dask Distributed DataFrame
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide