Slide 1

Slide 1 text

Expert Briefing – The State of Higher Performance Python @IanOzsvald – ianozsvald.com Ian Ozsvald PyDataGlobal 2022

Slide 2

Slide 2 text

•Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster - efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 3

Slide 3 text

 Interim Chief Data Scientist  20+ years experience  Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 4

Slide 4 text

•Pandas is default (but 5X memory usage hurts!) •Introspection into tools like Pandas is possible but rare •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 5

Slide 5 text

•Python 3.11 runs code in 40-80% of the time of 3.10 •3.12 will have a JIT •This doesn’t affect numpy, Pandas and friends (sadly) Python 3.11 By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

•Time-based fn viewer •Notebook & module support •Great for drilling into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 7

Slide 7 text

•Just-in-time compiler for math (NumPy & pure Python) •No direct Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

•Platform – e.g. integrated with sklearn, plotting tools •Increasing integration of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald 670MB vs 140MB for unique strings Bigger savings for repeated strings 6x speed-up for this operation

Slide 11

Slide 11 text

 .rolling lets us make a “rolling window” on our data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling + Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

Rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 40x speed- up to Cython

Slide 13

Slide 13 text

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 4 Dask talks at PyDataGlobal!

Slide 14

Slide 14 text

 Bigger than RAM or “I want to use all my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 15

Slide 15 text

•Vaex – medium data, HDF5 data, similar Pandas API •Polars – in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Bodo - commercial Pandas compiler+paralleliser Vaex/Polars/Modin/Bodo By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

Modin example with Ray (Dask worse) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

 Take advantage of the new tools  Newsletter:  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

Slide 19

Slide 19 text

•What blockers do you have? •Have you had success with Vaex/Modin/Polars/Rapids/Bodo/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald