Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sprinting Pandas (London Python)

ianozsvald
October 22, 2020

Sprinting Pandas (London Python)

ianozsvald

October 22, 2020
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Sprinting Pandas (live in London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    London Python October 2020

    View Slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    –I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide


  3. Pandas
    – Saving RAM to fit in more data
    – Calculating faster by dropping to Numpy

    Advice for “being highly performant”

    Has Covid 19 affected UK Company Registrations?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  4. Strings are expensive and slow
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  5. Categoricals are cheap and fast!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Circa 1% of previous memory cost

    View Slide

  6. Categoricals
    “.cat” accessor
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  7. Categoricals – over 10x speed up (on
    this data)!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Categoricals – index queries faster!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Circa 500x speed-up!

    View Slide

  9. float64 is default and a bit expensive
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  10. float32 “half-price” and a bit faster
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  11. Make choices to save RAM
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Including the index (previously
    we ignored it) we still save
    circa 50% RAM so you can fit in
    more rows of data

    View Slide

  12. “dtype_diet” gives you advice
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  13. Drop to NumPy if you know you can
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean
    which is slower – see my blog or PyDataAmsterdam 2020 talk for details

    View Slide

  14. NumPy vs Pandas overhead
    (ser.sum())
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25 files, 83 functions
    Very few NumPy
    calls!
    Thanks!
    https://github.com/ianozsvald/callgraph_james_powell

    View Slide

  15. Overhead...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  16. Overhead with ser.values.sum()
    By [ian]@ianozsvald[.com] Ian Ozsvald
    18 files, 51 functions
    Many fewer Pandas
    calls (but still a lot!)

    View Slide

  17. Is Pandas unnecessarily slow – NO!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/pandas-dev/pandas/issues/34773 -
    the truth is a bit complicated!

    View Slide


  18. Install optional (but great!) Pandas dependencies
    – bottleneck
    – numexpr

    Investigate https://github.com/ianozsvald/dtype_diet

    Investigate my ipython_memory_usage (PyPI/Conda)
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

    View Slide

  19. Pure Python is “slow” and expressive
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Deliberately poor function – pretend
    this is clever but slow!

    View Slide

  20. Compile to Numba judiciously
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Near 10x speed-up!

    View Slide

  21. Parallelise with Dask for multi-core
    By [ian]@ianozsvald[.com] Ian Ozsvald

    Make plain-Python
    code multi-core

    Note I had to drop text
    index column due to
    speed-hit

    Data copy cost can
    overwhelm any benefits
    so (always) profile &
    time

    View Slide


  22. Mistakes slow us down (PAY ATTENTION!)
    – Try nullable Int64 & boolean, forthcoming Float64
    – Write tests (unit & end-to-end)
    – Lots more material & my newsletter on my blog
    IanOzsvald.com
    – Time saving docs:
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  23. Memory mapped & lazy computation
    – New string dtype (RAM efficient)

    Modin sits on Pandas, new “algebra” for dfs
    – Drop in replacement, easy to try
    Vaex / Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    See talks on my blog:

    View Slide


  24. You have a huge dataset on a single harddrive

    Memory mapped files (HDF5) are best

    Numpy types and simpler Pandas-like functions

    Investment – similar but different API to Pandas
    When to try Vaex
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/vaexio/vaex/issues/968

    View Slide


  25. You want Pandas but ran out of RAM on 1 machine

    You want multi-machine cluster scalability

    You want multi-core support for operations like groupby
    on parallelisable datasets

    Investment – quick start then a learning curve
    When to try Dask
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  26. You want all of Pandas

    You have lots of RAM and many CPUs

    You’re doing groupby operations on many columns

    Investment – easy to try
    When to try Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/modin-project/modin/issues/1390

    View Slide

  27. Covid 19’s effect on UK Economy?
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Sharp decline in
    corporate
    registration
    after Lockdown
    – then apparent
    surge (perhaps
    just backed-up
    paperwork?).
    Will the
    recovery “last”?
    All open data,
    you can do
    similar things!

    View Slide


  28. Make it right then make it fast

    Think about being performant

    See blog for my classes

    I’d love a postcard if you learned
    something new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  29. Be faster by learning new approaches
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide