Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Pandas Fly (EuroPython 2020)

ianozsvald
July 24, 2020

Making Pandas Fly (EuroPython 2020)

I get to revisit giving my first tutorial at EuroPython in 2011 with this reprise on higher performance with RAM saving, Categories, NumPy, Numba and Dask.
Details here: https://ianozsvald.com/2020/07/24/making-pandas-fly-at-europython-2020/

ianozsvald

July 24, 2020
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Making Pandas Fly (live from
    London)
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    EuroPython 2020

    View Slide


  2. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    – I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide


  3. All volunteers – go say thank you in #lobby

    They’ve put in a huge amount of volunteered work for us!
    Thank the organisers!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  4. Pandas
    – Saving RAM to fit in more data
    – Calculating faster by dropping to Numpy

    Advice for “being highly performant”

    Has Covid 19 affected UK Company Registrations?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  5. Strings are expensive and slow
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  6. Categoricals are cheap and fast!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Circa 1% of previous memory cost

    View Slide

  7. Categoricals
    “.cat” accessor
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Categoricals – over 10x speed up (on
    this data)!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  9. Categoricals – index queries faster!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Circa 500x speed-up!

    View Slide

  10. float64 is default and a bit expensive
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  11. float32 “half-price” and a bit faster
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  12. Make choices to save RAM
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Including the index (previously
    we ignored it) we still save
    circa 50% RAM so you can fit in
    more rows of data

    View Slide

  13. “dtype_diet” gives you advice
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  14. Drop to NumPy if you know you can
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean
    which is slower – see my blog or PyDataAmsterdam 2020 talk for details

    View Slide

  15. NumPy vs Pandas overhead
    (ser.sum())
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25 files, 83 functions
    Very few NumPy
    calls!
    Thanks!

    View Slide

  16. Overhead...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  17. Overhead with ser.values.sum()
    By [ian]@ianozsvald[.com] Ian Ozsvald
    18 files, 51 functions
    Many fewer Pandas
    calls (but still a lot!)

    View Slide

  18. Is Pandas unnecessarily slow – NO!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/pandas-dev/pandas/issues/34773 -
    the truth is a bit complicated!

    View Slide


  19. Install optional (but great!) Pandas dependencies
    – bottleneck
    – numexpr

    Investigate https://github.com/ianozsvald/dtype_diet

    Investigate my ipython_memory_usage (PyPI/Conda)
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

    View Slide

  20. Pure Python is “slow” and expressive
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Deliberately poor function – pretend
    this is clever but slow!

    View Slide

  21. Compile to Numba judiciously
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Near 10x speed-up!

    View Slide

  22. Parallelise with Dask for multi-core
    By [ian]@ianozsvald[.com] Ian Ozsvald

    Make plain-Python
    code multi-core

    Note I had to drop text
    index column due to
    speed-hit

    Data copy cost can
    overwhelm any benefits
    so (always) profile &
    time

    View Slide


  23. Mistakes slow us down (PAY ATTENTION!)
    – Try nullable Int64 & boolean, forthcoming Float64
    – Write tests (unit & end-to-end)
    – Lots more material & my newsletter on my blog
    IanOzsvald.com
    – Time saving docs:
    Being highly performant
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  24. Memory mapped & lazy computation
    – New string dtype (RAM efficient)

    Modin sits on Pandas, new “algebra” for dfs
    – Drop in replacement, easy to try
    Vaex / Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald
    See talks on my blog:

    View Slide


  25. Make it right then make it fast

    Think about being performant

    See blog for my classes

    I’d love a postcard if you learned
    something new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  26. Covid 19’s effect on UK Economy?
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Sharp decline in
    corporate
    registration
    after Lockdown
    – then apparent
    surge (perhaps
    just backed-up
    paperwork?).
    Will the
    recovery “last”?
    All open data,
    you can do
    similar things!

    View Slide